Overview
Database monitoring encompasses the continuous collection, analysis, and visualization of metrics related to database performance, resource utilization, query execution, and system health. This practice provides visibility into database operations, enabling developers and operations teams to identify bottlenecks, prevent outages, and maintain optimal performance across application lifecycles.
Modern applications depend on databases as critical infrastructure components. A database performance degradation directly impacts application response times, user experience, and system availability. Database monitoring transforms opaque database operations into observable, measurable processes that support data-driven optimization decisions.
The scope of database monitoring extends beyond simple uptime checks. Comprehensive monitoring tracks query execution times, connection pool utilization, transaction rates, lock contention, replication lag, cache hit ratios, disk I/O patterns, and error frequencies. Each metric reveals specific aspects of database behavior that influence application performance.
# Basic database monitoring setup
require 'pg'
conn = PG.connect(dbname: 'production')
# Query current connection count
result = conn.exec(<<-SQL)
SELECT count(*) as connection_count,
max(extract(epoch from (now() - state_change))) as max_idle_time
FROM pg_stat_activity
WHERE state = 'idle'
SQL
puts "Idle connections: #{result[0]['connection_count']}"
puts "Max idle time: #{result[0]['max_idle_time']} seconds"
Database monitoring operates at multiple layers: infrastructure monitoring tracks CPU, memory, and disk utilization at the server level; database engine monitoring examines internal database metrics like buffer cache performance and checkpoint activity; application-level monitoring focuses on query patterns and connection behavior from the application perspective. Each layer provides distinct insights that complement the others.
Key Principles
Database monitoring relies on several fundamental concepts that define how metrics are collected, interpreted, and acted upon.
Metric Collection Frequency determines the granularity of monitoring data. High-frequency collection (every few seconds) captures transient issues like query spikes but generates substantial data volume. Low-frequency collection (every few minutes) reduces overhead but may miss short-lived problems. The collection interval must balance observability needs with monitoring system overhead.
Baseline Establishment involves capturing normal database behavior patterns during typical operating conditions. Baselines provide reference points for anomaly detection. A query averaging 50ms under normal load but suddenly taking 500ms indicates a performance regression. Without established baselines, distinguishing normal variation from genuine problems becomes difficult.
Active vs Passive Monitoring represents two collection approaches. Passive monitoring observes existing database operations without generating additional load. Active monitoring executes synthetic queries to verify database responsiveness. Active checks detect availability issues immediately but add operational overhead.
# Passive monitoring - observing existing operations
def passive_monitor(conn)
conn.exec(<<-SQL)
SELECT query,
calls,
total_exec_time,
mean_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10
SQL
end
# Active monitoring - synthetic health check
def active_monitor(conn)
start_time = Time.now
conn.exec("SELECT 1")
response_time = Time.now - start_time
{
status: response_time < 0.1 ? 'healthy' : 'degraded',
response_time_ms: (response_time * 1000).round(2)
}
end
Aggregation and Retention govern how detailed metrics are stored over time. Recent data maintains full granularity for immediate troubleshooting. Historical data aggregates into hourly or daily summaries to reduce storage requirements while preserving trend analysis capability. A common pattern retains second-level data for 24 hours, minute-level data for 7 days, and hourly data for months or years.
Threshold-Based Alerting triggers notifications when metrics exceed predefined limits. Simple threshold alerts fire when a value crosses a boundary (connections > 90% of max). Complex alerts combine multiple conditions or detect sustained threshold violations. Alert fatigue occurs when thresholds trigger too frequently, reducing response effectiveness.
Percentile Metrics provide more insight than simple averages. The 95th percentile (p95) shows the latency experienced by 95% of requests, filtering outliers that skew averages. The 99th percentile (p99) captures the worst-case scenarios most users experience. Maximum values often represent edge cases or measurement errors rather than typical behavior.
Database-Specific Metrics vary by database engine. PostgreSQL exposes statistics through system views like pg_stat_activity and pg_stat_statements. MySQL provides performance_schema and information_schema views. Each database has unique internal metrics reflecting its architecture and optimization strategies.
Ruby Implementation
Ruby applications typically monitor databases through several approaches: direct database querying, gem-based instrumentation, and external monitoring tool integration.
Direct Querying Approach executes monitoring queries using database adapters:
require 'pg'
class PostgresMonitor
def initialize(connection_string)
@conn = PG.connect(connection_string)
end
def connection_stats
result = @conn.exec(<<-SQL)
SELECT
count(*) FILTER (WHERE state = 'active') as active,
count(*) FILTER (WHERE state = 'idle') as idle,
count(*) FILTER (WHERE state = 'idle in transaction') as idle_in_transaction,
max(extract(epoch from (now() - state_change)))
FILTER (WHERE state = 'active') as longest_query_seconds
FROM pg_stat_activity
WHERE pid != pg_backend_pid()
SQL
result[0].transform_values { |v| v.to_f }
end
def slow_queries(threshold_ms = 1000)
@conn.exec_params(<<-SQL, [threshold_ms])
SELECT
query,
calls,
mean_exec_time,
max_exec_time,
total_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > $1
ORDER BY mean_exec_time DESC
LIMIT 20
SQL
end
def table_bloat
@conn.exec(<<-SQL)
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) as table_size,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) -
pg_relation_size(schemaname||'.'||tablename)) as indexes_size
FROM pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10
SQL
end
end
ActiveRecord Integration for Rails applications provides hooks into query execution:
# config/initializers/database_monitoring.rb
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
duration = (finish - start) * 1000 # Convert to milliseconds
# Skip schema queries and very fast queries
next if payload[:name] == 'SCHEMA' || duration < 100
# Log slow queries
if duration > 1000
Rails.logger.warn(
"Slow Query (#{duration.round(2)}ms): #{payload[:sql]}"
)
end
# Send metrics to monitoring system
StatsD.increment('database.queries.total')
StatsD.histogram('database.query.duration', duration)
StatsD.increment("database.queries.#{payload[:name]}")
end
# Track connection pool metrics
ActiveSupport::Notifications.subscribe('checkout.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
StatsD.histogram('database.pool.checkout_time', event.duration)
end
Connection Pool Monitoring tracks pool utilization patterns:
class ConnectionPoolMonitor
def self.monitor_pool(pool_name = 'primary')
pool = ActiveRecord::Base.connection_pool
{
size: pool.size,
connections: pool.connections.size,
in_use: pool.connections.count(&:in_use?),
available: pool.connections.count { |c| !c.in_use? },
waiting: pool.num_waiting_in_queue,
pool_utilization: (pool.connections.count(&:in_use?).to_f / pool.size * 100).round(2)
}
end
def self.check_pool_exhaustion
metrics = monitor_pool
if metrics[:pool_utilization] > 90
alert_high_pool_usage(metrics)
end
if metrics[:waiting] > 0
alert_connection_wait(metrics)
end
end
end
# Periodic monitoring
Thread.new do
loop do
metrics = ConnectionPoolMonitor.monitor_pool
StatsD.gauge('database.pool.size', metrics[:size])
StatsD.gauge('database.pool.in_use', metrics[:in_use])
StatsD.gauge('database.pool.waiting', metrics[:waiting])
sleep 10
end
end
Custom Metric Collection for application-specific monitoring:
class DatabaseMetricsCollector
def initialize(database_url)
@conn = PG.connect(database_url)
@metrics = {}
end
def collect_all
collect_connection_metrics
collect_query_performance_metrics
collect_replication_metrics
collect_lock_metrics
@metrics
end
private
def collect_connection_metrics
stats = @conn.exec(<<-SQL).first
SELECT
count(*) as total_connections,
count(*) FILTER (WHERE state = 'active') as active_connections,
count(*) FILTER (WHERE wait_event_type IS NOT NULL) as waiting_connections
FROM pg_stat_activity
SQL
@metrics[:connections] = stats.transform_values(&:to_i)
end
def collect_query_performance_metrics
cache_stats = @conn.exec(<<-SQL).first
SELECT
sum(blks_hit) as buffer_hits,
sum(blks_read) as disk_reads,
round(sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read), 0) * 100, 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = current_database()
SQL
@metrics[:cache] = {
hit_ratio: cache_stats['cache_hit_ratio'].to_f,
buffer_hits: cache_stats['buffer_hits'].to_i,
disk_reads: cache_stats['disk_reads'].to_i
}
end
def collect_replication_metrics
replication = @conn.exec(<<-SQL)
SELECT
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag_bytes
FROM pg_stat_replication
SQL
@metrics[:replication] = replication.map do |row|
{
replica: row['client_addr'],
state: row['state'],
sync_state: row['sync_state'],
lag_bytes: row['replay_lag_bytes'].to_i
}
end
end
def collect_lock_metrics
locks = @conn.exec(<<-SQL).first
SELECT
count(*) FILTER (WHERE mode LIKE '%ExclusiveLock') as exclusive_locks,
count(*) FILTER (WHERE NOT granted) as waiting_locks
FROM pg_locks
SQL
@metrics[:locks] = locks.transform_values(&:to_i)
end
end
Implementation Approaches
Database monitoring implementations vary based on deployment architecture, database technology, and organizational requirements.
Embedded Monitoring integrates collection logic directly within the application process. This approach provides immediate access to application context and requires no external services, but monitoring code executes within application memory and CPU budgets. Embedded monitoring suits small to medium deployments where monitoring overhead remains manageable.
# Embedded monitoring with periodic reporting
class EmbeddedMonitor
def initialize(report_interval: 60)
@report_interval = report_interval
@query_times = []
@mutex = Mutex.new
end
def record_query(duration, query_type)
@mutex.synchronize do
@query_times << {
duration: duration,
type: query_type,
timestamp: Time.now
}
end
end
def start_reporting
Thread.new do
loop do
sleep @report_interval
report_metrics
end
end
end
private
def report_metrics
@mutex.synchronize do
return if @query_times.empty?
durations = @query_times.map { |q| q[:duration] }
metrics = {
count: durations.size,
mean: durations.sum / durations.size,
p95: percentile(durations, 95),
p99: percentile(durations, 99),
max: durations.max
}
# Send to external system or log
Rails.logger.info("Database Metrics: #{metrics}")
@query_times.clear
end
end
def percentile(values, percentile)
sorted = values.sort
index = (percentile / 100.0 * sorted.length).ceil - 1
sorted[index]
end
end
Sidecar Monitoring deploys dedicated monitoring processes alongside applications. Sidecars collect metrics through inter-process communication or by polling the database independently. This approach isolates monitoring overhead from application processes and supports multiple applications monitoring the same database, though it requires additional deployment complexity.
Agent-Based Monitoring installs monitoring agents on database servers that export metrics for collection by centralized systems. Database agents have privileged access to server-level metrics like disk I/O, memory usage, and internal database statistics unavailable through standard SQL queries. Agent-based approaches excel at infrastructure monitoring but require server access and administrative permissions.
Query Log Analysis examines database query logs to extract performance patterns. Log analysis operates without active database queries, analyzing historical execution patterns. This approach provides comprehensive query coverage but introduces latency between query execution and metric availability.
# Log parser for PostgreSQL CSV logs
class PostgresLogParser
def initialize(log_file_path)
@log_file_path = log_file_path
end
def parse_slow_queries(threshold_ms = 1000)
slow_queries = []
File.foreach(@log_file_path) do |line|
next unless line.include?('duration:')
if match = line.match(/duration: ([\d.]+) ms.*statement: (.+)$/)
duration = match[1].to_f
query = match[2]
if duration >= threshold_ms
slow_queries << {
duration: duration,
query: normalize_query(query),
timestamp: extract_timestamp(line)
}
end
end
end
aggregate_queries(slow_queries)
end
private
def normalize_query(query)
# Replace specific values with placeholders for aggregation
query.gsub(/\d+/, 'N')
.gsub(/'[^']+'/, "'?'")
.gsub(/\s+/, ' ')
.strip
end
def aggregate_queries(queries)
queries.group_by { |q| q[:query] }.map do |normalized_query, instances|
{
query: normalized_query,
count: instances.size,
total_duration: instances.sum { |i| i[:duration] },
avg_duration: instances.sum { |i| i[:duration] } / instances.size,
max_duration: instances.map { |i| i[:duration] }.max
}
end.sort_by { |q| -q[:total_duration] }
end
end
Sampling-Based Monitoring collects metrics from a subset of operations rather than every transaction. Sampling reduces monitoring overhead while maintaining statistical validity for common patterns. A 1% sample rate captures sufficient data for trend analysis while minimizing performance impact, though rare events may go undetected.
Continuous Aggregation pre-computes summary statistics within the database using materialized views or background processes. Applications query pre-aggregated metrics rather than computing them on-demand, reducing query overhead at the cost of slightly stale data.
Tools & Ecosystem
The Ruby ecosystem provides multiple tools and gems for database monitoring, each addressing different aspects of observability.
pg_query parses PostgreSQL queries for analysis:
require 'pg_query'
class QueryAnalyzer
def analyze_query(sql)
parsed = PgQuery.parse(sql)
{
tables: extract_tables(parsed),
query_type: determine_query_type(parsed),
has_joins: contains_joins?(parsed),
estimated_complexity: estimate_complexity(parsed)
}
end
private
def extract_tables(parsed)
parsed.tables
end
def determine_query_type(parsed)
case parsed.tree.stmts.first.stmt
when PgQuery::SelectStmt
'SELECT'
when PgQuery::InsertStmt
'INSERT'
when PgQuery::UpdateStmt
'UPDATE'
when PgQuery::DeleteStmt
'DELETE'
else
'OTHER'
end
end
end
marginalia annotates queries with application context for monitoring:
# Gemfile
gem 'marginalia'
# config/initializers/marginalia.rb
Marginalia::Comment.components = [:application, :controller, :action, :job]
# Queries now include comments:
# SELECT * FROM users /* application:my_app,controller:users,action:index */
Query annotations enable correlating database load with specific application endpoints or background jobs, simplifying performance troubleshooting.
pg-extras provides convenient access to PostgreSQL statistics:
require 'pg-extras'
# Connection stats
PgExtras.connections
# Long-running queries
PgExtras.long_running_queries
# Unused indexes
PgExtras.unused_indexes
# Table sizes
PgExtras.table_sizes
database_cleaner ensures clean test environments but also enables monitoring test database behavior:
require 'database_cleaner'
DatabaseCleaner.strategy = :transaction
# Monitor test database overhead
test_start = Time.now
DatabaseCleaner.cleaning do
# Run tests
end
test_duration = Time.now - test_start
puts "Test suite database time: #{test_duration}s"
Rack::MiniProfiler profiles database queries in development and staging:
# Gemfile
gem 'rack-mini-profiler'
# config/environments/development.rb
config.middleware.use Rack::MiniProfiler
# Shows timing breakdown for all queries on each request
PrometheusExporter exports application and database metrics:
require 'prometheus_exporter/client'
class DatabasePrometheusExporter
def initialize
@client = PrometheusExporter::Client.default
end
def export_pool_metrics
pool = ActiveRecord::Base.connection_pool
@client.send_json(
type: 'database_pool',
metric_labels: { pool: 'primary' },
size: pool.size,
in_use: pool.connections.count(&:in_use?),
waiting: pool.num_waiting_in_queue
)
end
def export_query_metrics(duration, query_type)
@client.send_json(
type: 'database_query',
metric_labels: { type: query_type },
duration: duration
)
end
end
Scout APM and New Relic provide commercial monitoring solutions with Ruby gems:
# Gemfile
gem 'scout_apm'
# config/initializers/scout_apm.rb
ScoutApm::Config.instance.tap do |config|
config.database_queries = true
config.slow_query_threshold = 500 # milliseconds
end
These services offer comprehensive monitoring including query analysis, endpoint profiling, and alerting without requiring infrastructure setup.
Integration & Interoperability
Database monitoring integrates with broader observability systems through standardized metrics, logging, and tracing protocols.
Metrics Export typically uses formats like Prometheus, StatsD, or CloudWatch:
require 'statsd-instrument'
class DatabaseMetricsExporter
def initialize
StatsD.backend = StatsD::Instrument::Backends::UDPBackend.new(
'localhost:8125',
:statsd
)
end
def export_connection_metrics
pool = ActiveRecord::Base.connection_pool
StatsD.gauge('db.pool.size', pool.size)
StatsD.gauge('db.pool.in_use', pool.connections.count(&:in_use?))
StatsD.gauge('db.pool.available', pool.connections.count { |c| !c.in_use? })
StatsD.gauge('db.pool.waiting', pool.num_waiting_in_queue)
end
def export_query_metrics(duration, query_name)
StatsD.histogram('db.query.duration', duration, tags: ["query:#{query_name}"])
StatsD.increment('db.query.count', tags: ["query:#{query_name}"])
end
end
# Integration with ActiveSupport::Notifications
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
exporter = DatabaseMetricsExporter.new
exporter.export_query_metrics(
event.duration,
event.payload[:name] || 'unknown'
)
end
Structured Logging provides query-level visibility:
class DatabaseLogger
def initialize(logger)
@logger = logger
end
def log_query(sql, duration, rows_affected)
@logger.info(
event: 'database_query',
sql: sanitize_sql(sql),
duration_ms: duration,
rows: rows_affected,
timestamp: Time.now.iso8601
)
end
private
def sanitize_sql(sql)
# Remove sensitive data from logged queries
sql.gsub(/password\s*=\s*'[^']+'/, "password='[REDACTED]'")
.gsub(/\d{16}/, '[CARD_NUMBER]')
end
end
# Integration point
ActiveSupport::Notifications.subscribe('sql.active_record') do |name, start, finish, id, payload|
duration = (finish - start) * 1000
logger = DatabaseLogger.new(Rails.logger)
logger.log_query(
payload[:sql],
duration,
payload[:row_count] || 0
)
end
Distributed Tracing correlates database operations across service boundaries:
require 'opentelemetry/sdk'
require 'opentelemetry/instrumentation/active_record'
OpenTelemetry::SDK.configure do |c|
c.service_name = 'my_application'
c.use 'OpenTelemetry::Instrumentation::ActiveRecord'
end
# Database queries automatically appear in distributed traces
# with parent request context
Health Check Endpoints expose database status for load balancers and orchestration:
# config/routes.rb
get '/health/database', to: 'health#database'
# app/controllers/health_controller.rb
class HealthController < ApplicationController
def database
checks = {
connection: check_connection,
pool: check_pool_health,
replication: check_replication_lag
}
status = checks.values.all? { |c| c[:healthy] } ? 200 : 503
render json: {
status: status == 200 ? 'healthy' : 'unhealthy',
checks: checks,
timestamp: Time.now.iso8601
}, status: status
end
private
def check_connection
ActiveRecord::Base.connection.execute('SELECT 1')
{ healthy: true, message: 'Connection successful' }
rescue => e
{ healthy: false, message: e.message }
end
def check_pool_health
pool = ActiveRecord::Base.connection_pool
utilization = (pool.connections.count(&:in_use?).to_f / pool.size * 100).round(2)
{
healthy: utilization < 90,
utilization: utilization,
message: "Pool #{utilization}% utilized"
}
end
def check_replication_lag
lag = ActiveRecord::Base.connection.execute(<<-SQL).first
SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())) as lag
SQL
lag_seconds = lag['lag'].to_f
{
healthy: lag_seconds < 30,
lag_seconds: lag_seconds,
message: "Replication lag: #{lag_seconds}s"
}
end
end
Alert Manager Integration sends notifications based on metric thresholds:
class DatabaseAlertManager
def initialize(alerting_service)
@alerting_service = alerting_service
end
def check_and_alert
metrics = collect_metrics
evaluate_alert_rules(metrics).each do |alert|
@alerting_service.send_alert(alert)
end
end
private
def evaluate_alert_rules(metrics)
alerts = []
if metrics[:pool_utilization] > 90
alerts << {
severity: 'critical',
title: 'Database Pool Exhaustion',
description: "Connection pool at #{metrics[:pool_utilization]}%",
timestamp: Time.now
}
end
if metrics[:replication_lag] > 300
alerts << {
severity: 'warning',
title: 'High Replication Lag',
description: "Replica lagging by #{metrics[:replication_lag]} seconds",
timestamp: Time.now
}
end
alerts
end
end
Performance Considerations
Database monitoring introduces overhead that must be balanced against observability benefits. Each monitoring approach carries distinct performance implications.
Query Overhead from monitoring queries competes with application queries for database resources. Statistics queries that aggregate across large tables or indexes can consume significant CPU and I/O. Monitoring queries should use indexes, avoid full table scans, and execute during periods of lower load when possible.
# Inefficient monitoring query - full table scan
def count_users_inefficient
User.count # Can be slow on large tables
end
# Efficient monitoring query - uses statistics
def count_users_efficient
ActiveRecord::Base.connection.execute(<<-SQL).first
SELECT reltuples::bigint as estimate
FROM pg_class
WHERE relname = 'users'
SQL
end
Collection Frequency directly impacts monitoring overhead. Collecting metrics every second provides high resolution but generates 60x more overhead than collecting every minute. High-frequency collection suits detecting transient issues while lower frequencies suffice for trend analysis.
class AdaptiveMonitoring
def initialize
@baseline_response_time = nil
@collection_interval = 60 # Start with 60 second intervals
end
def monitor
loop do
metrics = collect_metrics
# Increase frequency if performance degrades
if performance_degraded?(metrics)
@collection_interval = 10
else
@collection_interval = 60
end
sleep @collection_interval
end
end
private
def performance_degraded?(metrics)
return false unless @baseline_response_time
metrics[:response_time] > @baseline_response_time * 1.5
end
end
Memory Usage from metric storage grows with collection frequency and retention duration. Storing second-level metrics for a week requires substantially more memory than minute-level metrics. Applications should aggregate historical data progressively, maintaining full resolution for recent data while summarizing older metrics.
Connection Pool Impact occurs when monitoring consumes connections from the application pool. Dedicated monitoring connections prevent this, but require larger connection pools. The tradeoff between pool size and monitoring isolation depends on connection limits and application concurrency needs.
# Separate connection pool for monitoring
class MonitoringDatabase < ActiveRecord::Base
establish_connection(
adapter: 'postgresql',
database: ENV['DATABASE_NAME'],
pool: 2, # Small dedicated pool
checkout_timeout: 5
)
end
# Monitoring queries use separate pool
def monitor_database
MonitoringDatabase.connection.execute(<<-SQL)
SELECT * FROM pg_stat_activity
SQL
end
Sampling Strategies reduce overhead by monitoring a subset of operations. A 1% sample rate decreases monitoring overhead by 99% while maintaining statistical validity for common patterns. However, sampling may miss rare but critical events like infrequent slow queries.
Buffering and Batching defer metric transmission to reduce network overhead and database load. Collecting metrics locally and transmitting in batches reduces the number of monitoring transactions at the cost of slightly delayed visibility.
class BatchedMetricsCollector
def initialize(batch_size: 100, flush_interval: 30)
@batch_size = batch_size
@flush_interval = flush_interval
@buffer = []
@mutex = Mutex.new
start_flush_timer
end
def record_metric(metric)
@mutex.synchronize do
@buffer << metric
flush if @buffer.size >= @batch_size
end
end
private
def start_flush_timer
Thread.new do
loop do
sleep @flush_interval
flush
end
end
end
def flush
return if @buffer.empty?
metrics_to_send = @buffer.dup
@buffer.clear
# Send batch to monitoring system
send_to_monitoring_system(metrics_to_send)
end
end
Index Impact on monitoring queries determines whether they execute efficiently or cause table scans. Monitoring queries that join multiple system tables or filter by non-indexed columns can slow database operations. Creating indexes specifically for monitoring queries may be necessary in high-scale environments.
Reference
Core Monitoring Metrics
| Metric | Description | Target Range | Alert Threshold |
|---|---|---|---|
| Connection Count | Active database connections | Varies by workload | >80% of max |
| Pool Utilization | Percentage of pool in use | <70% typical | >90% |
| Query Duration P95 | 95th percentile query time | <100ms | >1000ms |
| Query Duration P99 | 99th percentile query time | <500ms | >5000ms |
| Cache Hit Ratio | Buffer cache hit percentage | >95% | <90% |
| Locks Waiting | Queries waiting for locks | 0 | >5 |
| Replication Lag | Seconds behind primary | <1s | >30s |
| Checkpoint Write Time | Time spent writing checkpoints | <30s | >60s |
| Deadlocks | Deadlock occurrences | 0 | >0 |
| Rollback Rate | Percentage of rolled back transactions | <1% | >5% |
PostgreSQL System Views
| View | Purpose | Key Columns |
|---|---|---|
| pg_stat_activity | Current database activity | state, query, state_change |
| pg_stat_statements | Query execution statistics | query, calls, total_exec_time |
| pg_stat_database | Database-wide statistics | blks_hit, blks_read, conflicts |
| pg_stat_replication | Replication status | state, sync_state, replay_lag |
| pg_locks | Current lock status | mode, granted, relation |
| pg_stat_user_tables | Table access patterns | seq_scan, idx_scan, n_tup_upd |
| pg_stat_user_indexes | Index usage statistics | idx_scan, idx_tup_read |
| pg_statio_user_tables | Table I/O statistics | heap_blks_read, heap_blks_hit |
Common Monitoring Queries
| Query Type | Use Case | Execution Frequency |
|---|---|---|
| Connection stats | Track active connections | Every 10-30 seconds |
| Slow query detection | Identify performance issues | Every 1-5 minutes |
| Cache hit ratio | Assess buffer efficiency | Every 1 minute |
| Lock contention | Detect blocking queries | Every 30 seconds |
| Replication lag | Verify replica health | Every 10 seconds |
| Table bloat | Plan maintenance | Daily |
| Index usage | Optimize indexes | Weekly |
| Vacuum progress | Monitor maintenance | During vacuum |
Monitoring Collection Methods
| Method | Overhead | Latency | Use Case |
|---|---|---|---|
| Direct SQL queries | Low-Medium | Immediate | Real-time monitoring |
| Query log analysis | Low | Minutes-Hours | Historical analysis |
| Statistics views | Very Low | Immediate | Lightweight monitoring |
| External agents | Medium | Seconds | Infrastructure monitoring |
| Application instrumentation | Low | Immediate | Request correlation |
| Sampling | Very Low | Immediate | High-volume systems |
Alert Severity Levels
| Level | Criteria | Response Time | Example |
|---|---|---|---|
| Critical | Service degraded or down | Immediate | Connection failure |
| Warning | Performance degradation | 15 minutes | High query latency |
| Info | Threshold approaching | 1 hour | Pool at 75% |
| Debug | Detailed diagnostics | Non-urgent | Slow query logged |
Ruby Monitoring Gems
| Gem | Primary Function | Best For |
|---|---|---|
| pg-extras | PostgreSQL statistics helpers | Quick diagnostics |
| marginalia | Query annotation | Request correlation |
| rack-mini-profiler | Development profiling | Local optimization |
| prometheus_exporter | Metrics export | Production monitoring |
| scout_apm | Commercial APM | Full-stack observability |
| newrelic_rpm | Commercial APM | Enterprise monitoring |
| pg_query | Query parsing | Query analysis |
| database_cleaner | Test isolation | Test environment |
Performance Impact Guidelines
| Monitoring Approach | CPU Impact | Memory Impact | Network Impact |
|---|---|---|---|
| Passive observation | <1% | Low | None |
| Active health checks | 1-3% | Low | Low |
| Full query logging | 5-15% | Medium-High | Medium |
| High-frequency polling | 3-10% | Medium | Low-Medium |
| Distributed tracing | 2-5% | Medium | Medium |
| Real-time streaming | 5-10% | High | High |