CrackedRuby CrackedRuby

Overview

Database monitoring encompasses the continuous collection, analysis, and visualization of metrics related to database performance, resource utilization, query execution, and system health. This practice provides visibility into database operations, enabling developers and operations teams to identify bottlenecks, prevent outages, and maintain optimal performance across application lifecycles.

Modern applications depend on databases as critical infrastructure components. A database performance degradation directly impacts application response times, user experience, and system availability. Database monitoring transforms opaque database operations into observable, measurable processes that support data-driven optimization decisions.

The scope of database monitoring extends beyond simple uptime checks. Comprehensive monitoring tracks query execution times, connection pool utilization, transaction rates, lock contention, replication lag, cache hit ratios, disk I/O patterns, and error frequencies. Each metric reveals specific aspects of database behavior that influence application performance.

# Basic database monitoring setup
require 'pg'

conn = PG.connect(dbname: 'production')

# Query current connection count
result = conn.exec(<<-SQL)
  SELECT count(*) as connection_count,
         max(extract(epoch from (now() - state_change))) as max_idle_time
  FROM pg_stat_activity
  WHERE state = 'idle'
SQL

puts "Idle connections: #{result[0]['connection_count']}"
puts "Max idle time: #{result[0]['max_idle_time']} seconds"

Database monitoring operates at multiple layers: infrastructure monitoring tracks CPU, memory, and disk utilization at the server level; database engine monitoring examines internal database metrics like buffer cache performance and checkpoint activity; application-level monitoring focuses on query patterns and connection behavior from the application perspective. Each layer provides distinct insights that complement the others.

Key Principles

Database monitoring relies on several fundamental concepts that define how metrics are collected, interpreted, and acted upon.

Metric Collection Frequency determines the granularity of monitoring data. High-frequency collection (every few seconds) captures transient issues like query spikes but generates substantial data volume. Low-frequency collection (every few minutes) reduces overhead but may miss short-lived problems. The collection interval must balance observability needs with monitoring system overhead.

Baseline Establishment involves capturing normal database behavior patterns during typical operating conditions. Baselines provide reference points for anomaly detection. A query averaging 50ms under normal load but suddenly taking 500ms indicates a performance regression. Without established baselines, distinguishing normal variation from genuine problems becomes difficult.

Active vs Passive Monitoring represents two collection approaches. Passive monitoring observes existing database operations without generating additional load. Active monitoring executes synthetic queries to verify database responsiveness. Active checks detect availability issues immediately but add operational overhead.

# Passive monitoring - observing existing operations
def passive_monitor(conn)
  conn.exec(<<-SQL)
    SELECT query, 
           calls, 
           total_exec_time,
           mean_exec_time
    FROM pg_stat_statements
    ORDER BY total_exec_time DESC
    LIMIT 10
  SQL
end

# Active monitoring - synthetic health check
def active_monitor(conn)
  start_time = Time.now
  conn.exec("SELECT 1")
  response_time = Time.now - start_time
  
  {
    status: response_time < 0.1 ? 'healthy' : 'degraded',
    response_time_ms: (response_time * 1000).round(2)
  }
end

Aggregation and Retention govern how detailed metrics are stored over time. Recent data maintains full granularity for immediate troubleshooting. Historical data aggregates into hourly or daily summaries to reduce storage requirements while preserving trend analysis capability. A common pattern retains second-level data for 24 hours, minute-level data for 7 days, and hourly data for months or years.

Threshold-Based Alerting triggers notifications when metrics exceed predefined limits. Simple threshold alerts fire when a value crosses a boundary (connections > 90% of max). Complex alerts combine multiple conditions or detect sustained threshold violations. Alert fatigue occurs when thresholds trigger too frequently, reducing response effectiveness.

Percentile Metrics provide more insight than simple averages. The 95th percentile (p95) shows the latency experienced by 95% of requests, filtering outliers that skew averages. The 99th percentile (p99) captures the worst-case scenarios most users experience. Maximum values often represent edge cases or measurement errors rather than typical behavior.

Database-Specific Metrics vary by database engine. PostgreSQL exposes statistics through system views like pg_stat_activity and pg_stat_statements. MySQL provides performance_schema and information_schema views. Each database has unique internal metrics reflecting its architecture and optimization strategies.

Ruby Implementation

Ruby applications typically monitor databases through several approaches: direct database querying, gem-based instrumentation, and external monitoring tool integration.

Direct Querying Approach executes monitoring queries using database adapters:

require 'pg'

class PostgresMonitor
  def initialize(connection_string)
    @conn = PG.connect(connection_string)
  end
  
  def connection_stats
    result = @conn.exec(<<-SQL)
      SELECT 
        count(*) FILTER (WHERE state = 'active') as active,
        count(*) FILTER (WHERE state = 'idle') as idle,
        count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction,
        max(extract(epoch from (now() - state_change))) 
          FILTER (WHERE state = 'active') as longest_query_seconds
      FROM pg_stat_activity
      WHERE pid != pg_backend_pid()
    SQL
    
    result[0].transform_values { |v| v.to_f }
  end
  
  def slow_queries(threshold_ms = 1000)
    @conn.exec_params(<<-SQL, [threshold_ms])
      SELECT 
        query,
        calls,
        mean_exec_time,
        max_exec_time,
        total_exec_time
      FROM pg_stat_statements
      WHERE mean_exec_time > $1
      ORDER BY mean_exec_time DESC
      LIMIT 20
    SQL
  end
  
  def table_bloat
    @conn.exec(<<-SQL)
      SELECT 
        schemaname,
        tablename,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
        pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
        pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - 
                       pg_relation_size(schemaname||'.'||tablename)) as indexes_size
      FROM pg_tables
      WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
      ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
      LIMIT 10
    SQL
  end
end

ActiveRecord Integration for Rails applications provides hooks into query execution:

# config/initializers/database_monitoring.rb
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
  duration = (finish - start) * 1000 # Convert to milliseconds
  
  # Skip schema queries and very fast queries
  next if payload[:name] == 'SCHEMA' || duration < 100
  
  # Log slow queries
  if duration > 1000
    Rails.logger.warn(
      "Slow Query (#{duration.round(2)}ms): #{payload[:sql]}"
    )
  end
  
  # Send metrics to monitoring system
  StatsD.increment('database.queries.total')
  StatsD.histogram('database.query.duration', duration)
  StatsD.increment("database.queries.#{payload[:name]}")
end

# Track connection pool metrics
ActiveSupport::Notifications.subscribe('checkout.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  StatsD.histogram('database.pool.checkout_time', event.duration)
end

Connection Pool Monitoring tracks pool utilization patterns:

class ConnectionPoolMonitor
  def self.monitor_pool(pool_name = 'primary')
    pool = ActiveRecord::Base.connection_pool
    
    {
      size: pool.size,
      connections: pool.connections.size,
      in_use: pool.connections.count(&:in_use?),
      available: pool.connections.count { |c| !c.in_use? },
      waiting: pool.num_waiting_in_queue,
      pool_utilization: (pool.connections.count(&:in_use?).to_f / pool.size * 100).round(2)
    }
  end
  
  def self.check_pool_exhaustion
    metrics = monitor_pool
    
    if metrics[:pool_utilization] > 90
      alert_high_pool_usage(metrics)
    end
    
    if metrics[:waiting] > 0
      alert_connection_wait(metrics)
    end
  end
end

# Periodic monitoring
Thread.new do
  loop do
    metrics = ConnectionPoolMonitor.monitor_pool
    StatsD.gauge('database.pool.size', metrics[:size])
    StatsD.gauge('database.pool.in_use', metrics[:in_use])
    StatsD.gauge('database.pool.waiting', metrics[:waiting])
    
    sleep 10
  end
end

Custom Metric Collection for application-specific monitoring:

class DatabaseMetricsCollector
  def initialize(database_url)
    @conn = PG.connect(database_url)
    @metrics = {}
  end
  
  def collect_all
    collect_connection_metrics
    collect_query_performance_metrics
    collect_replication_metrics
    collect_lock_metrics
    @metrics
  end
  
  private
  
  def collect_connection_metrics
    stats = @conn.exec(<<-SQL).first
      SELECT 
        count(*) as total_connections,
        count(*) FILTER (WHERE state = 'active') as active_connections,
        count(*) FILTER (WHERE wait_event_type IS NOT NULL) as waiting_connections
      FROM pg_stat_activity
    SQL
    
    @metrics[:connections] = stats.transform_values(&:to_i)
  end
  
  def collect_query_performance_metrics
    cache_stats = @conn.exec(<<-SQL).first
      SELECT 
        sum(blks_hit) as buffer_hits,
        sum(blks_read) as disk_reads,
        round(sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read), 0) * 100, 2) as cache_hit_ratio
      FROM pg_stat_database
      WHERE datname = current_database()
    SQL
    
    @metrics[:cache] = {
      hit_ratio: cache_stats['cache_hit_ratio'].to_f,
      buffer_hits: cache_stats['buffer_hits'].to_i,
      disk_reads: cache_stats['disk_reads'].to_i
    }
  end
  
  def collect_replication_metrics
    replication = @conn.exec(<<-SQL)
      SELECT 
        client_addr,
        state,
        sync_state,
        pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag_bytes,
        pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag_bytes
      FROM pg_stat_replication
    SQL
    
    @metrics[:replication] = replication.map do |row|
      {
        replica: row['client_addr'],
        state: row['state'],
        sync_state: row['sync_state'],
        lag_bytes: row['replay_lag_bytes'].to_i
      }
    end
  end
  
  def collect_lock_metrics
    locks = @conn.exec(<<-SQL).first
      SELECT 
        count(*) FILTER (WHERE mode LIKE '%ExclusiveLock') as exclusive_locks,
        count(*) FILTER (WHERE NOT granted) as waiting_locks
      FROM pg_locks
    SQL
    
    @metrics[:locks] = locks.transform_values(&:to_i)
  end
end

Implementation Approaches

Database monitoring implementations vary based on deployment architecture, database technology, and organizational requirements.

Embedded Monitoring integrates collection logic directly within the application process. This approach provides immediate access to application context and requires no external services, but monitoring code executes within application memory and CPU budgets. Embedded monitoring suits small to medium deployments where monitoring overhead remains manageable.

# Embedded monitoring with periodic reporting
class EmbeddedMonitor
  def initialize(report_interval: 60)
    @report_interval = report_interval
    @query_times = []
    @mutex = Mutex.new
  end
  
  def record_query(duration, query_type)
    @mutex.synchronize do
      @query_times << {
        duration: duration,
        type: query_type,
        timestamp: Time.now
      }
    end
  end
  
  def start_reporting
    Thread.new do
      loop do
        sleep @report_interval
        report_metrics
      end
    end
  end
  
  private
  
  def report_metrics
    @mutex.synchronize do
      return if @query_times.empty?
      
      durations = @query_times.map { |q| q[:duration] }
      
      metrics = {
        count: durations.size,
        mean: durations.sum / durations.size,
        p95: percentile(durations, 95),
        p99: percentile(durations, 99),
        max: durations.max
      }
      
      # Send to external system or log
      Rails.logger.info("Database Metrics: #{metrics}")
      
      @query_times.clear
    end
  end
  
  def percentile(values, percentile)
    sorted = values.sort
    index = (percentile / 100.0 * sorted.length).ceil - 1
    sorted[index]
  end
end

Sidecar Monitoring deploys dedicated monitoring processes alongside applications. Sidecars collect metrics through inter-process communication or by polling the database independently. This approach isolates monitoring overhead from application processes and supports multiple applications monitoring the same database, though it requires additional deployment complexity.

Agent-Based Monitoring installs monitoring agents on database servers that export metrics for collection by centralized systems. Database agents have privileged access to server-level metrics like disk I/O, memory usage, and internal database statistics unavailable through standard SQL queries. Agent-based approaches excel at infrastructure monitoring but require server access and administrative permissions.

Query Log Analysis examines database query logs to extract performance patterns. Log analysis operates without active database queries, analyzing historical execution patterns. This approach provides comprehensive query coverage but introduces latency between query execution and metric availability.

# Log parser for PostgreSQL CSV logs
class PostgresLogParser
  def initialize(log_file_path)
    @log_file_path = log_file_path
  end
  
  def parse_slow_queries(threshold_ms = 1000)
    slow_queries = []
    
    File.foreach(@log_file_path) do |line|
      next unless line.include?('duration:')
      
      if match = line.match(/duration: ([\d.]+) ms.*statement: (.+)$/)
        duration = match[1].to_f
        query = match[2]
        
        if duration >= threshold_ms
          slow_queries << {
            duration: duration,
            query: normalize_query(query),
            timestamp: extract_timestamp(line)
          }
        end
      end
    end
    
    aggregate_queries(slow_queries)
  end
  
  private
  
  def normalize_query(query)
    # Replace specific values with placeholders for aggregation
    query.gsub(/\d+/, 'N')
         .gsub(/'[^']+'/, "'?'")
         .gsub(/\s+/, ' ')
         .strip
  end
  
  def aggregate_queries(queries)
    queries.group_by { |q| q[:query] }.map do |normalized_query, instances|
      {
        query: normalized_query,
        count: instances.size,
        total_duration: instances.sum { |i| i[:duration] },
        avg_duration: instances.sum { |i| i[:duration] } / instances.size,
        max_duration: instances.map { |i| i[:duration] }.max
      }
    end.sort_by { |q| -q[:total_duration] }
  end
end

Sampling-Based Monitoring collects metrics from a subset of operations rather than every transaction. Sampling reduces monitoring overhead while maintaining statistical validity for common patterns. A 1% sample rate captures sufficient data for trend analysis while minimizing performance impact, though rare events may go undetected.

Continuous Aggregation pre-computes summary statistics within the database using materialized views or background processes. Applications query pre-aggregated metrics rather than computing them on-demand, reducing query overhead at the cost of slightly stale data.

Tools & Ecosystem

The Ruby ecosystem provides multiple tools and gems for database monitoring, each addressing different aspects of observability.

pg_query parses PostgreSQL queries for analysis:

require 'pg_query'

class QueryAnalyzer
  def analyze_query(sql)
    parsed = PgQuery.parse(sql)
    
    {
      tables: extract_tables(parsed),
      query_type: determine_query_type(parsed),
      has_joins: contains_joins?(parsed),
      estimated_complexity: estimate_complexity(parsed)
    }
  end
  
  private
  
  def extract_tables(parsed)
    parsed.tables
  end
  
  def determine_query_type(parsed)
    case parsed.tree.stmts.first.stmt
    when PgQuery::SelectStmt
      'SELECT'
    when PgQuery::InsertStmt
      'INSERT'
    when PgQuery::UpdateStmt
      'UPDATE'
    when PgQuery::DeleteStmt
      'DELETE'
    else
      'OTHER'
    end
  end
end

marginalia annotates queries with application context for monitoring:

# Gemfile
gem 'marginalia'

# config/initializers/marginalia.rb
Marginalia::Comment.components = [:application, :controller, :action, :job]

# Queries now include comments:
# SELECT * FROM users /* application:my_app,controller:users,action:index */

Query annotations enable correlating database load with specific application endpoints or background jobs, simplifying performance troubleshooting.

pg-extras provides convenient access to PostgreSQL statistics:

require 'pg-extras'

# Connection stats
PgExtras.connections

# Long-running queries
PgExtras.long_running_queries

# Unused indexes
PgExtras.unused_indexes

# Table sizes
PgExtras.table_sizes

database_cleaner ensures clean test environments but also enables monitoring test database behavior:

require 'database_cleaner'

DatabaseCleaner.strategy = :transaction

# Monitor test database overhead
test_start = Time.now
DatabaseCleaner.cleaning do
  # Run tests
end
test_duration = Time.now - test_start

puts "Test suite database time: #{test_duration}s"

Rack::MiniProfiler profiles database queries in development and staging:

# Gemfile
gem 'rack-mini-profiler'

# config/environments/development.rb
config.middleware.use Rack::MiniProfiler

# Shows timing breakdown for all queries on each request

PrometheusExporter exports application and database metrics:

require 'prometheus_exporter/client'

class DatabasePrometheusExporter
  def initialize
    @client = PrometheusExporter::Client.default
  end
  
  def export_pool_metrics
    pool = ActiveRecord::Base.connection_pool
    
    @client.send_json(
      type: 'database_pool',
      metric_labels: { pool: 'primary' },
      size: pool.size,
      in_use: pool.connections.count(&:in_use?),
      waiting: pool.num_waiting_in_queue
    )
  end
  
  def export_query_metrics(duration, query_type)
    @client.send_json(
      type: 'database_query',
      metric_labels: { type: query_type },
      duration: duration
    )
  end
end

Scout APM and New Relic provide commercial monitoring solutions with Ruby gems:

# Gemfile
gem 'scout_apm'

# config/initializers/scout_apm.rb
ScoutApm::Config.instance.tap do |config|
  config.database_queries = true
  config.slow_query_threshold = 500 # milliseconds
end

These services offer comprehensive monitoring including query analysis, endpoint profiling, and alerting without requiring infrastructure setup.

Integration & Interoperability

Database monitoring integrates with broader observability systems through standardized metrics, logging, and tracing protocols.

Metrics Export typically uses formats like Prometheus, StatsD, or CloudWatch:

require 'statsd-instrument'

class DatabaseMetricsExporter
  def initialize
    StatsD.backend = StatsD::Instrument::Backends::UDPBackend.new(
      'localhost:8125',
      :statsd
    )
  end
  
  def export_connection_metrics
    pool = ActiveRecord::Base.connection_pool
    
    StatsD.gauge('db.pool.size', pool.size)
    StatsD.gauge('db.pool.in_use', pool.connections.count(&:in_use?))
    StatsD.gauge('db.pool.available', pool.connections.count { |c| !c.in_use? })
    StatsD.gauge('db.pool.waiting', pool.num_waiting_in_queue)
  end
  
  def export_query_metrics(duration, query_name)
    StatsD.histogram('db.query.duration', duration, tags: ["query:#{query_name}"])
    StatsD.increment('db.query.count', tags: ["query:#{query_name}"])
  end
end

# Integration with ActiveSupport::Notifications
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  
  exporter = DatabaseMetricsExporter.new
  exporter.export_query_metrics(
    event.duration,
    event.payload[:name] || 'unknown'
  )
end

Structured Logging provides query-level visibility:

class DatabaseLogger
  def initialize(logger)
    @logger = logger
  end
  
  def log_query(sql, duration, rows_affected)
    @logger.info(
      event: 'database_query',
      sql: sanitize_sql(sql),
      duration_ms: duration,
      rows: rows_affected,
      timestamp: Time.now.iso8601
    )
  end
  
  private
  
  def sanitize_sql(sql)
    # Remove sensitive data from logged queries
    sql.gsub(/password\s*=\s*'[^']+'/, "password='[REDACTED]'")
       .gsub(/\d{16}/, '[CARD_NUMBER]')
  end
end

# Integration point
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
  duration = (finish - start) * 1000
  
  logger = DatabaseLogger.new(Rails.logger)
  logger.log_query(
    payload[:sql],
    duration,
    payload[:row_count] || 0
  )
end

Distributed Tracing correlates database operations across service boundaries:

require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/active_record'

OpenTelemetry::SDK.configure do |c|
  c.service_name = 'my_application'
  c.use 'OpenTelemetry::Instrumentation::ActiveRecord'
end

# Database queries automatically appear in distributed traces
# with parent request context

Health Check Endpoints expose database status for load balancers and orchestration:

# config/routes.rb
get '/health/database', to: 'health#database'

# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def database
    checks = {
      connection: check_connection,
      pool: check_pool_health,
      replication: check_replication_lag
    }
    
    status = checks.values.all? { |c| c[:healthy] } ? 200 : 503
    
    render json: {
      status: status == 200 ? 'healthy' : 'unhealthy',
      checks: checks,
      timestamp: Time.now.iso8601
    }, status: status
  end
  
  private
  
  def check_connection
    ActiveRecord::Base.connection.execute('SELECT 1')
    { healthy: true, message: 'Connection successful' }
  rescue => e
    { healthy: false, message: e.message }
  end
  
  def check_pool_health
    pool = ActiveRecord::Base.connection_pool
    utilization = (pool.connections.count(&:in_use?).to_f / pool.size * 100).round(2)
    
    {
      healthy: utilization < 90,
      utilization: utilization,
      message: "Pool #{utilization}% utilized"
    }
  end
  
  def check_replication_lag
    lag = ActiveRecord::Base.connection.execute(<<-SQL).first
      SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())) as lag
    SQL
    
    lag_seconds = lag['lag'].to_f
    
    {
      healthy: lag_seconds < 30,
      lag_seconds: lag_seconds,
      message: "Replication lag: #{lag_seconds}s"
    }
  end
end

Alert Manager Integration sends notifications based on metric thresholds:

class DatabaseAlertManager
  def initialize(alerting_service)
    @alerting_service = alerting_service
  end
  
  def check_and_alert
    metrics = collect_metrics
    
    evaluate_alert_rules(metrics).each do |alert|
      @alerting_service.send_alert(alert)
    end
  end
  
  private
  
  def evaluate_alert_rules(metrics)
    alerts = []
    
    if metrics[:pool_utilization] > 90
      alerts << {
        severity: 'critical',
        title: 'Database Pool Exhaustion',
        description: "Connection pool at #{metrics[:pool_utilization]}%",
        timestamp: Time.now
      }
    end
    
    if metrics[:replication_lag] > 300
      alerts << {
        severity: 'warning',
        title: 'High Replication Lag',
        description: "Replica lagging by #{metrics[:replication_lag]} seconds",
        timestamp: Time.now
      }
    end
    
    alerts
  end
end

Performance Considerations

Database monitoring introduces overhead that must be balanced against observability benefits. Each monitoring approach carries distinct performance implications.

Query Overhead from monitoring queries competes with application queries for database resources. Statistics queries that aggregate across large tables or indexes can consume significant CPU and I/O. Monitoring queries should use indexes, avoid full table scans, and execute during periods of lower load when possible.

# Inefficient monitoring query - full table scan
def count_users_inefficient
  User.count # Can be slow on large tables
end

# Efficient monitoring query - uses statistics
def count_users_efficient
  ActiveRecord::Base.connection.execute(<<-SQL).first
    SELECT reltuples::bigint as estimate
    FROM pg_class
    WHERE relname = 'users'
  SQL
end

Collection Frequency directly impacts monitoring overhead. Collecting metrics every second provides high resolution but generates 60x more overhead than collecting every minute. High-frequency collection suits detecting transient issues while lower frequencies suffice for trend analysis.

class AdaptiveMonitoring
  def initialize
    @baseline_response_time = nil
    @collection_interval = 60 # Start with 60 second intervals
  end
  
  def monitor
    loop do
      metrics = collect_metrics
      
      # Increase frequency if performance degrades
      if performance_degraded?(metrics)
        @collection_interval = 10
      else
        @collection_interval = 60
      end
      
      sleep @collection_interval
    end
  end
  
  private
  
  def performance_degraded?(metrics)
    return false unless @baseline_response_time
    
    metrics[:response_time] > @baseline_response_time * 1.5
  end
end

Memory Usage from metric storage grows with collection frequency and retention duration. Storing second-level metrics for a week requires substantially more memory than minute-level metrics. Applications should aggregate historical data progressively, maintaining full resolution for recent data while summarizing older metrics.

Connection Pool Impact occurs when monitoring consumes connections from the application pool. Dedicated monitoring connections prevent this, but require larger connection pools. The tradeoff between pool size and monitoring isolation depends on connection limits and application concurrency needs.

# Separate connection pool for monitoring
class MonitoringDatabase < ActiveRecord::Base
  establish_connection(
    adapter: 'postgresql',
    database: ENV['DATABASE_NAME'],
    pool: 2, # Small dedicated pool
    checkout_timeout: 5
  )
end

# Monitoring queries use separate pool
def monitor_database
  MonitoringDatabase.connection.execute(<<-SQL)
    SELECT * FROM pg_stat_activity
  SQL
end

Sampling Strategies reduce overhead by monitoring a subset of operations. A 1% sample rate decreases monitoring overhead by 99% while maintaining statistical validity for common patterns. However, sampling may miss rare but critical events like infrequent slow queries.

Buffering and Batching defer metric transmission to reduce network overhead and database load. Collecting metrics locally and transmitting in batches reduces the number of monitoring transactions at the cost of slightly delayed visibility.

class BatchedMetricsCollector
  def initialize(batch_size: 100, flush_interval: 30)
    @batch_size = batch_size
    @flush_interval = flush_interval
    @buffer = []
    @mutex = Mutex.new
    
    start_flush_timer
  end
  
  def record_metric(metric)
    @mutex.synchronize do
      @buffer << metric
      flush if @buffer.size >= @batch_size
    end
  end
  
  private
  
  def start_flush_timer
    Thread.new do
      loop do
        sleep @flush_interval
        flush
      end
    end
  end
  
  def flush
    return if @buffer.empty?
    
    metrics_to_send = @buffer.dup
    @buffer.clear
    
    # Send batch to monitoring system
    send_to_monitoring_system(metrics_to_send)
  end
end

Index Impact on monitoring queries determines whether they execute efficiently or cause table scans. Monitoring queries that join multiple system tables or filter by non-indexed columns can slow database operations. Creating indexes specifically for monitoring queries may be necessary in high-scale environments.

Reference

Core Monitoring Metrics

Metric Description Target Range Alert Threshold
Connection Count Active database connections Varies by workload >80% of max
Pool Utilization Percentage of pool in use <70% typical >90%
Query Duration P95 95th percentile query time <100ms >1000ms
Query Duration P99 99th percentile query time <500ms >5000ms
Cache Hit Ratio Buffer cache hit percentage >95% <90%
Locks Waiting Queries waiting for locks 0 >5
Replication Lag Seconds behind primary <1s >30s
Checkpoint Write Time Time spent writing checkpoints <30s >60s
Deadlocks Deadlock occurrences 0 >0
Rollback Rate Percentage of rolled back transactions <1% >5%

PostgreSQL System Views

View Purpose Key Columns
pg_stat_activity Current database activity state, query, state_change
pg_stat_statements Query execution statistics query, calls, total_exec_time
pg_stat_database Database-wide statistics blks_hit, blks_read, conflicts
pg_stat_replication Replication status state, sync_state, replay_lag
pg_locks Current lock status mode, granted, relation
pg_stat_user_tables Table access patterns seq_scan, idx_scan, n_tup_upd
pg_stat_user_indexes Index usage statistics idx_scan, idx_tup_read
pg_statio_user_tables Table I/O statistics heap_blks_read, heap_blks_hit

Common Monitoring Queries

Query Type Use Case Execution Frequency
Connection stats Track active connections Every 10-30 seconds
Slow query detection Identify performance issues Every 1-5 minutes
Cache hit ratio Assess buffer efficiency Every 1 minute
Lock contention Detect blocking queries Every 30 seconds
Replication lag Verify replica health Every 10 seconds
Table bloat Plan maintenance Daily
Index usage Optimize indexes Weekly
Vacuum progress Monitor maintenance During vacuum

Monitoring Collection Methods

Method Overhead Latency Use Case
Direct SQL queries Low-Medium Immediate Real-time monitoring
Query log analysis Low Minutes-Hours Historical analysis
Statistics views Very Low Immediate Lightweight monitoring
External agents Medium Seconds Infrastructure monitoring
Application instrumentation Low Immediate Request correlation
Sampling Very Low Immediate High-volume systems

Alert Severity Levels

Level Criteria Response Time Example
Critical Service degraded or down Immediate Connection failure
Warning Performance degradation 15 minutes High query latency
Info Threshold approaching 1 hour Pool at 75%
Debug Detailed diagnostics Non-urgent Slow query logged

Ruby Monitoring Gems

Gem Primary Function Best For
pg-extras PostgreSQL statistics helpers Quick diagnostics
marginalia Query annotation Request correlation
rack-mini-profiler Development profiling Local optimization
prometheus_exporter Metrics export Production monitoring
scout_apm Commercial APM Full-stack observability
newrelic_rpm Commercial APM Enterprise monitoring
pg_query Query parsing Query analysis
database_cleaner Test isolation Test environment

Performance Impact Guidelines

Monitoring Approach CPU Impact Memory Impact Network Impact
Passive observation <1% Low None
Active health checks 1-3% Low Low
Full query logging 5-15% Medium-High Medium
High-frequency polling 3-10% Medium Low-Medium
Distributed tracing 2-5% Medium Medium
Real-time streaming 5-10% High High