Overview
Network monitoring involves the continuous observation and measurement of network infrastructure to ensure availability, performance, and security. The practice encompasses tracking bandwidth usage, latency, packet loss, device status, and security events across routers, switches, servers, and applications.
Modern applications depend on reliable network connectivity. A web application may interact with databases, caching layers, external APIs, and content delivery networks. Network monitoring provides visibility into these connections, identifying bottlenecks, failures, and security threats before they impact users.
Network monitoring operates at multiple layers of the network stack. At the physical layer, monitoring tracks device availability and hardware health. At the transport layer, it measures TCP connection states, retransmissions, and latency. At the application layer, it analyzes HTTP response times, API endpoint performance, and protocol-specific metrics.
The fundamental monitoring workflow involves data collection through agents or probes, metric aggregation in a central system, threshold-based alerting, and visualization through dashboards. Active monitoring sends synthetic requests to measure availability and response times. Passive monitoring captures actual traffic patterns without generating additional load.
require 'socket'
# Simple ICMP ping check
def check_host_reachability(host, timeout = 5)
start_time = Time.now
begin
TCPSocket.new(host, 80, connect_timeout: timeout).close
latency = ((Time.now - start_time) * 1000).round(2)
{ status: :reachable, latency_ms: latency }
rescue Errno::ETIMEDOUT, Errno::ECONNREFUSED
{ status: :unreachable, latency_ms: nil }
end
end
# => { status: :reachable, latency_ms: 45.32 }
Network monitoring distinguishes between synthetic monitoring and real user monitoring. Synthetic monitoring executes scripted transactions from controlled locations, providing consistent baselines. Real user monitoring captures actual user experience by instrumenting application traffic.
Key Principles
Network monitoring operates on several foundational principles that guide effective implementation and interpretation of results.
Continuous Collection: Network conditions change constantly. Sporadic checks miss transient issues that impact user experience. Monitoring systems collect data at regular intervals, typically ranging from one second for critical metrics to several minutes for less volatile measurements. The collection interval balances data granularity against storage and processing overhead.
Multi-Layer Visibility: Networks operate across multiple protocol layers. Monitoring a single layer provides incomplete information. A web request involves DNS resolution, TCP connection establishment, TLS negotiation, HTTP exchange, and application processing. Effective monitoring instruments each layer to pinpoint failure locations.
Baseline Establishment: Raw metrics require context. A 200ms API response time means nothing without knowing typical performance. Monitoring systems establish baselines through statistical analysis of historical data, then detect deviations that indicate problems. Baselines account for daily and weekly patterns in traffic.
Distributed Perspective: Monitoring from a single location reveals only that location's network path. Production systems serve users from multiple geographic regions through varied network paths. Distributed monitoring deploys probes across locations to capture regional differences in performance and availability.
Active and Passive Monitoring: Active monitoring sends synthetic requests to measure availability and response times under controlled conditions. Passive monitoring analyzes actual traffic without generating load. Active monitoring detects failures immediately but may miss issues affecting only specific user segments. Passive monitoring captures real user experience but requires traffic to detect problems.
Threshold-Based Alerting: Continuous monitoring generates massive data volumes. Engineers cannot manually inspect every metric. Alerting systems evaluate metrics against thresholds, triggering notifications when values exceed acceptable bounds. Effective thresholds balance sensitivity against false positive rates.
Protocol-Specific Metrics: Different protocols require different monitoring approaches. HTTP monitoring tracks status codes, response times, and content validity. DNS monitoring measures query resolution time and record accuracy. SMTP monitoring verifies mail server connectivity and delivery times. Protocol-specific monitoring provides actionable insights for troubleshooting.
The data collection mechanism varies by monitoring type. SNMP polling queries devices for management information. Flow-based monitoring analyzes NetFlow or sFlow data exported by network devices. Packet capture examines individual packets for deep protocol analysis. Log aggregation collects and parses application and system logs.
require 'snmp'
# SNMP polling for interface metrics
def collect_interface_stats(host, community = 'public')
SNMP::Manager.open(host: host, community: community) do |manager|
response = manager.get([
'1.3.6.1.2.1.2.2.1.10.1', # ifInOctets
'1.3.6.1.2.1.2.2.1.16.1', # ifOutOctets
'1.3.6.1.2.1.2.2.1.8.1' # ifOperStatus
])
{
bytes_in: response.varbind_list[0].value.to_i,
bytes_out: response.varbind_list[1].value.to_i,
status: response.varbind_list[2].value.to_i == 1 ? :up : :down
}
end
end
Metric aggregation reduces data volume while preserving meaningful information. Raw per-second metrics may aggregate into one-minute averages, with minimum, maximum, and percentile values retained for variance analysis. Time-series databases optimize storage and retrieval of timestamped metric data.
Ruby Implementation
Ruby provides multiple approaches for implementing network monitoring, from low-level socket operations to high-level HTTP clients and specialized monitoring gems.
Socket-Level Monitoring: The Socket library enables direct TCP and UDP communication for implementing custom protocols or performing low-level connectivity checks.
require 'socket'
require 'timeout'
class NetworkMonitor
def check_tcp_port(host, port, timeout = 5)
start = Time.now
begin
Timeout.timeout(timeout) do
socket = TCPSocket.new(host, port)
socket.close
{
available: true,
response_time: ((Time.now - start) * 1000).round(2)
}
end
rescue Timeout::Error
{ available: false, error: 'Connection timeout' }
rescue Errno::ECONNREFUSED
{ available: false, error: 'Connection refused' }
rescue Errno::EHOSTUNREACH
{ available: false, error: 'Host unreachable' }
rescue SocketError => e
{ available: false, error: "DNS or network error: #{e.message}" }
end
end
def check_udp_service(host, port, payload, expected_response)
socket = UDPSocket.new
socket.send(payload, 0, host, port)
begin
Timeout.timeout(5) do
response, = socket.recvfrom(1024)
{ success: response.include?(expected_response) }
end
rescue Timeout::Error
{ success: false, error: 'No response' }
ensure
socket.close
end
end
end
monitor = NetworkMonitor.new
monitor.check_tcp_port('google.com', 443)
# => { available: true, response_time: 23.45 }
HTTP Endpoint Monitoring: The Net::HTTP library and modern HTTP clients like HTTP.rb provide comprehensive request/response monitoring capabilities.
require 'net/http'
require 'uri'
require 'json'
class HTTPMonitor
def check_endpoint(url, expected_status: 200, timeout: 10)
uri = URI(url)
start = Time.now
Net::HTTP.start(uri.host, uri.port,
use_ssl: uri.scheme == 'https',
open_timeout: timeout,
read_timeout: timeout) do |http|
request = Net::HTTP::Get.new(uri)
response = http.request(request)
duration = ((Time.now - start) * 1000).round(2)
{
status_code: response.code.to_i,
response_time: duration,
success: response.code.to_i == expected_status,
body_size: response.body.bytesize,
headers: response.to_hash
}
end
rescue Net::OpenTimeout
{ success: false, error: 'Connection timeout' }
rescue Net::ReadTimeout
{ success: false, error: 'Read timeout' }
rescue StandardError => e
{ success: false, error: e.message }
end
def check_api_endpoint(url, expected_keys: [])
result = check_endpoint(url)
if result[:success]
begin
json = JSON.parse(result.delete(:body))
missing_keys = expected_keys - json.keys
result[:valid_response] = missing_keys.empty?
result[:missing_keys] = missing_keys unless missing_keys.empty?
rescue JSON::ParserError
result[:valid_response] = false
result[:error] = 'Invalid JSON response'
end
end
result
end
end
monitor = HTTPMonitor.new
monitor.check_api_endpoint(
'https://api.example.com/health',
expected_keys: ['status', 'version']
)
DNS Monitoring: The Resolv library enables DNS query monitoring to track resolution times and validate DNS configurations.
require 'resolv'
class DNSMonitor
def check_resolution(hostname, expected_ip: nil)
start = Time.now
resolver = Resolv::DNS.new(nameserver: ['8.8.8.8'])
begin
addresses = resolver.getaddresses(hostname)
resolution_time = ((Time.now - start) * 1000).round(2)
result = {
resolved: true,
resolution_time: resolution_time,
addresses: addresses.map(&:to_s)
}
if expected_ip
result[:matches_expected] = addresses.map(&:to_s).include?(expected_ip)
end
result
rescue Resolv::ResolvError
{ resolved: false, error: 'DNS resolution failed' }
ensure
resolver.close
end
end
def check_dns_records(hostname, record_type: :A)
resolver = Resolv::DNS.new
resources = case record_type
when :A then resolver.getresources(hostname, Resolv::DNS::Resource::IN::A)
when :MX then resolver.getresources(hostname, Resolv::DNS::Resource::IN::MX)
when :TXT then resolver.getresources(hostname, Resolv::DNS::Resource::IN::TXT)
end
{
record_type: record_type,
count: resources.size,
records: resources.map { |r| r.address.to_s rescue r.to_s }
}
ensure
resolver.close
end
end
Ping Monitoring: The net-ping gem provides ICMP, TCP, and UDP ping implementations for host reachability checks.
require 'net/ping'
class PingMonitor
def icmp_ping(host, count: 4)
pinger = Net::Ping::ICMP.new(host)
results = []
count.times do
result = pinger.ping
results << {
success: result,
duration: pinger.duration ? (pinger.duration * 1000).round(2) : nil
}
sleep 1
end
successful = results.count { |r| r[:success] }
durations = results.select { |r| r[:duration] }.map { |r| r[:duration] }
{
packets_sent: count,
packets_received: successful,
packet_loss: ((count - successful) / count.to_f * 100).round(2),
avg_duration: durations.empty? ? nil : (durations.sum / durations.size).round(2),
min_duration: durations.min,
max_duration: durations.max
}
end
def tcp_ping(host, port, timeout: 5)
pinger = Net::Ping::TCP.new(host, port, timeout)
result = pinger.ping
{
reachable: result,
port: port,
duration: result ? (pinger.duration * 1000).round(2) : nil
}
end
end
monitor = PingMonitor.new
monitor.icmp_ping('8.8.8.8')
# => { packets_sent: 4, packets_received: 4, packet_loss: 0.0, avg_duration: 12.5, ... }
SNMP Monitoring: The snmp gem implements SNMP v1, v2c, and v3 for querying network device metrics.
require 'snmp'
class SNMPMonitor
OID_SYSTEM_UPTIME = '1.3.6.1.2.1.1.3.0'
OID_SYSTEM_DESCR = '1.3.6.1.2.1.1.1.0'
OID_IF_TABLE = '1.3.6.1.2.1.2.2.1'
def initialize(host, community: 'public', version: :SNMPv2c)
@host = host
@community = community
@version = version
end
def get_system_info
SNMP::Manager.open(host: @host, community: @community, version: @version) do |manager|
response = manager.get([OID_SYSTEM_UPTIME, OID_SYSTEM_DESCR])
{
uptime_ticks: response.varbind_list[0].value.to_i,
uptime_seconds: response.varbind_list[0].value.to_i / 100,
description: response.varbind_list[1].value.to_s
}
end
end
def get_interface_stats(interface_index)
oids = {
in_octets: "#{OID_IF_TABLE}.10.#{interface_index}",
out_octets: "#{OID_IF_TABLE}.16.#{interface_index}",
in_errors: "#{OID_IF_TABLE}.14.#{interface_index}",
out_errors: "#{OID_IF_TABLE}.20.#{interface_index}",
status: "#{OID_IF_TABLE}.8.#{interface_index}"
}
SNMP::Manager.open(host: @host, community: @community) do |manager|
response = manager.get(oids.values)
Hash[oids.keys.zip(response.varbind_list.map(&:value))]
end
end
def walk_interface_table
interfaces = []
SNMP::Manager.open(host: @host, community: @community) do |manager|
manager.walk(OID_IF_TABLE) do |row|
interfaces << {
oid: row.name.to_s,
value: row.value
}
end
end
interfaces
end
end
Implementation Approaches
Network monitoring implementations vary based on monitoring scope, infrastructure scale, and performance requirements.
Centralized Polling: A central monitoring server periodically polls monitored targets to collect metrics. The monitoring server maintains a list of targets and check intervals, executing checks according to schedule. This approach provides simple deployment and configuration management, with all monitoring logic concentrated in a central location.
Centralized polling scales to hundreds of monitored endpoints before network bandwidth or monitoring server capacity becomes constrained. For larger deployments, multiple monitoring servers operate in parallel, each responsible for a subset of targets. Load balancing distributes checks across servers.
Advantages include simplified configuration management, consistent check execution, and centralized result storage. Disadvantages include network overhead from polling traffic, potential monitoring gaps during network partitions, and scaling limitations for very large deployments.
Distributed Agents: Monitoring agents run on each monitored system, collecting local metrics and forwarding results to a central aggregation service. Agents access local system resources directly, gathering detailed metrics unavailable through remote polling. Agent-based monitoring scales to thousands of endpoints with minimal network overhead.
The agent push model inverts the polling relationship. Rather than the monitoring server pulling metrics, agents push metrics at regular intervals. This approach traverses firewalls and NAT more reliably than polling. Agents cache metrics during network outages, transmitting buffered data when connectivity restores.
Agents require deployment and maintenance across all monitored systems. Configuration management systems like Chef, Puppet, or Ansible automate agent deployment. Containerized environments use sidecar containers for monitoring agents.
Hybrid Architecture: Large-scale monitoring combines centralized polling for external endpoints with distributed agents for internal systems. External website monitoring uses centralized probes from multiple geographic locations. Internal server monitoring uses local agents for detailed metrics. This approach optimizes network efficiency while maintaining comprehensive visibility.
Real-Time Stream Processing: High-frequency monitoring generates massive data volumes unsuitable for batch processing. Stream processing systems like Apache Kafka consume monitoring metrics in real-time, applying filtering, aggregation, and alerting rules before persisting results. This approach handles millions of metrics per second while maintaining sub-second alerting latency.
Monitoring data flows through ingestion, processing, storage, and visualization stages. Ingestion normalizes data from heterogeneous sources into common formats. Processing applies aggregation, anomaly detection, and threshold evaluation. Storage optimizes metric retention and query performance. Visualization presents metrics through dashboards and graphs.
Time-Series Database Storage: Time-series databases optimize metric storage and retrieval. Specialized storage engines achieve high compression ratios through delta encoding and downsampling. Query engines support temporal aggregations and time-based filtering efficiently. Retention policies automatically downsample and expire old data.
Popular time-series databases include Prometheus, InfluxDB, and TimescaleDB. Selection depends on query patterns, data volume, and integration requirements. Prometheus excels at short-term storage with powerful query language. InfluxDB provides flexible schema and clustering. TimescaleDB extends PostgreSQL with time-series optimizations.
Tools & Ecosystem
The Ruby ecosystem includes several monitoring tools and gems for implementing network monitoring solutions.
Prometheus Client: The prometheus-client gem exports application metrics in Prometheus format, enabling integration with Prometheus monitoring infrastructure.
require 'prometheus/client'
require 'prometheus/client/push'
# Initialize registry and metrics
prometheus = Prometheus::Client.registry
http_requests = prometheus.counter(
:http_requests_total,
docstring: 'Total HTTP requests',
labels: [:method, :path, :status]
)
request_duration = prometheus.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
)
# Instrument application code
def handle_request(method, path)
start = Time.now
# Application logic
status = perform_request(method, path)
duration = Time.now - start
http_requests.increment(labels: { method: method, path: path, status: status })
request_duration.observe(duration, labels: { method: method, path: path })
status
end
# Push metrics to Prometheus Pushgateway
Prometheus::Client::Push.new(
job: 'batch_job',
gateway: 'http://pushgateway:9091'
).add(prometheus)
StatsD Client: The statsd-ruby gem sends metrics to StatsD servers for aggregation and forwarding to monitoring backends.
require 'statsd'
statsd = Statsd.new('localhost', 8125)
# Increment counters
statsd.increment('api.requests')
statsd.increment('api.errors', tags: ['endpoint:users', 'status:500'])
# Record timing
statsd.timing('api.response_time', 250)
# Time block execution
statsd.time('database.query') do
# Database operation
end
# Gauge for current values
statsd.gauge('active_connections', connection_pool.size)
# Set for unique values
statsd.set('unique_users', user_id)
HTTP Monitoring: The faraday gem provides middleware for HTTP request instrumentation and monitoring.
require 'faraday'
class MonitoringMiddleware < Faraday::Middleware
def initialize(app, statsd:)
super(app)
@statsd = statsd
end
def call(env)
start = Time.now
@app.call(env).on_complete do |response|
duration = ((Time.now - start) * 1000).round(2)
@statsd.timing('http.request.duration', duration)
@statsd.increment('http.request.count',
tags: ["status:#{response.status}", "host:#{env.url.host}"])
if response.status >= 500
@statsd.increment('http.request.errors',
tags: ["status:#{response.status}"])
end
end
end
end
# Configure Faraday with monitoring
conn = Faraday.new(url: 'https://api.example.com') do |f|
f.use MonitoringMiddleware, statsd: statsd
f.adapter Faraday.default_adapter
end
System Metrics: The sys-proctable and vmstat gems collect system-level metrics for host monitoring.
require 'sys/proctable'
require 'vmstat'
class SystemMonitor
def collect_process_metrics
processes = Sys::ProcTable.ps
{
total_processes: processes.count,
total_threads: processes.sum { |p| p.nlwp.to_i },
cpu_percentage: processes.sum { |p| p.pctcpu.to_f },
memory_rss_mb: processes.sum { |p| p.rss.to_i } / 1024
}
end
def collect_system_metrics
memory = Vmstat.memory
cpu = Vmstat.cpu
{
memory: {
total_mb: memory.pagesize * memory.wired / (1024 * 1024),
active_mb: memory.pagesize * memory.active / (1024 * 1024),
free_mb: memory.pagesize * memory.free / (1024 * 1024)
},
cpu: {
user: cpu.user,
system: cpu.system,
idle: cpu.idle
},
load_average: Vmstat.load_average.one_minute
}
end
end
Application Performance Monitoring: Full-featured APM solutions like New Relic, DataDog, and Scout provide comprehensive monitoring with minimal instrumentation.
The monitoring gem ecosystem continues expanding with specialized tools for specific protocols and platforms. Selecting appropriate tools depends on monitoring requirements, existing infrastructure, and team expertise.
Performance Considerations
Network monitoring introduces overhead that affects both monitored systems and monitoring infrastructure. Understanding performance characteristics enables effective monitoring without degrading system performance.
Monitoring Overhead: Each monitoring check consumes CPU, memory, network bandwidth, and storage. Active HTTP checks generate network requests that increase server load. Passive packet capture processes every network packet, consuming CPU cycles. Balancing monitoring granularity against resource consumption requires careful tuning.
High-frequency polling generates significant network traffic. Monitoring 1,000 endpoints every second produces 1,000 requests per second, potentially saturating network links or overwhelming monitored systems. Adjusting check intervals based on metric volatility reduces unnecessary overhead. Critical metrics warrant frequent checks; stable metrics require infrequent validation.
Collection Efficiency: Efficient metric collection minimizes monitoring overhead. Batching multiple metrics into single requests reduces network round trips. Connection pooling eliminates repeated connection establishment overhead. Asynchronous collection prevents blocking on slow endpoints.
require 'concurrent'
require 'net/http'
class EfficientMonitor
def initialize(concurrency: 10)
@pool = Concurrent::FixedThreadPool.new(concurrency)
end
def check_endpoints_parallel(endpoints)
futures = endpoints.map do |endpoint|
Concurrent::Future.execute(executor: @pool) do
check_endpoint(endpoint)
end
end
futures.map(&:value)
end
def check_endpoint(url)
uri = URI(url)
start = Time.now
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.read_timeout = 5
http.open_timeout = 5
response = http.get(uri.path)
duration = ((Time.now - start) * 1000).round(2)
{ url: url, status: response.code.to_i, duration: duration }
end
rescue StandardError => e
{ url: url, error: e.message }
end
end
monitor = EfficientMonitor.new(concurrency: 20)
results = monitor.check_endpoints_parallel([
'https://api1.example.com/health',
'https://api2.example.com/health',
'https://api3.example.com/health'
])
Metric Aggregation: Raw metrics generate enormous data volumes. Aggregating metrics before storage reduces storage requirements and query costs. Computing statistics like average, minimum, maximum, and percentiles during collection eliminates need for post-processing.
Time-series databases apply downsampling to reduce resolution of old data. Recent data maintains high resolution for detailed analysis. Older data aggregates into hourly or daily summaries. Retention policies automatically delete data exceeding configured age.
Query Performance: Monitoring dashboards execute numerous queries simultaneously. Inefficient queries degrade dashboard performance and increase database load. Proper indexing, query optimization, and caching improve query response times.
Pre-computing frequently queried aggregations reduces query complexity. Materialized views maintain pre-aggregated results updated incrementally. This trades increased storage for faster query response.
Alerting Latency: Alert delivery latency affects incident response time. Monitoring systems evaluate alert conditions continuously, typically every 30-60 seconds. Complex alert conditions increase evaluation time. Keeping alert logic simple maintains low latency.
Batching alerts reduces notification volume but increases delivery latency. Immediate notification ensures rapid response at the cost of potential alert fatigue from high-frequency notifications.
Security Implications
Network monitoring accesses sensitive information and requires careful security consideration to prevent data exposure and system compromise.
Credential Management: Monitoring systems authenticate to monitored services using credentials. Storing credentials securely prevents unauthorized access. Secret management systems like HashiCorp Vault or AWS Secrets Manager encrypt credentials at rest and provide audit logging.
Avoid hardcoding credentials in monitoring scripts. Environment variables provide basic credential injection but expose secrets in process listings. Dedicated secret management integrates with monitoring tools for secure credential retrieval.
require 'aws-sdk-secretsmanager'
class SecureMonitor
def initialize(region: 'us-east-1')
@secrets_client = Aws::SecretsManager::Client.new(region: region)
end
def get_monitoring_credentials(secret_name)
response = @secrets_client.get_secret_value(secret_id: secret_name)
JSON.parse(response.secret_string)
rescue Aws::SecretsManager::Errors::ResourceNotFoundException
raise "Secret #{secret_name} not found"
end
def monitor_with_secure_credentials(url, secret_name)
credentials = get_monitoring_credentials(secret_name)
uri = URI(url)
request = Net::HTTP::Get.new(uri)
request.basic_auth(credentials['username'], credentials['password'])
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
response = http.request(request)
{ status: response.code.to_i, success: response.is_a?(Net::HTTPSuccess) }
end
end
end
Sensitive Data Exposure: Network monitoring captures potentially sensitive information including authentication tokens, API keys, and personal data in URLs or request bodies. Proper data sanitization prevents sensitive information from appearing in logs or metrics.
Implementing data masking redacts sensitive patterns before logging. Regular expressions identify common sensitive patterns like credit card numbers, Social Security numbers, and API keys. Masking replaces identified values with placeholder text.
TLS Certificate Validation: Monitoring HTTPS endpoints requires proper TLS certificate validation. Disabling certificate verification creates man-in-the-middle vulnerabilities. Monitoring systems must validate certificates against trusted certificate authorities.
Custom certificate authorities require explicit trust configuration. Monitoring agents need access to CA certificate bundles for validation. Certificate expiration monitoring prevents outages from expired certificates.
Access Control: Monitoring dashboards display sensitive operational information. Proper authentication and authorization prevent unauthorized access. Role-based access control limits data visibility based on user permissions.
Monitoring agents running on production systems require minimal privileges. Service accounts with limited permissions reduce blast radius from compromised agents. Network segmentation restricts monitoring traffic to dedicated monitoring networks.
Data Retention: Monitoring data retention policies balance compliance requirements against storage costs. Regulations like GDPR require data minimization and retention limits. Configuring appropriate retention periods prevents indefinite storage of sensitive operational data.
Time-series databases support automatic data expiration based on age. Setting retention policies ensures old data deletes automatically. Critical security events may warrant longer retention than routine metrics.
Real-World Applications
Network monitoring operates across diverse production environments, each with specific monitoring requirements and challenges.
Microservices Architecture: Distributed microservices increase monitoring complexity. Each service requires health checks, performance monitoring, and dependency tracking. Service mesh technologies like Istio automatically instrument service-to-service communication, providing detailed latency and error rate metrics.
Distributed tracing tracks requests across service boundaries, identifying performance bottlenecks in complex call chains. Trace context propagation attaches trace IDs to requests, enabling correlation of logs and metrics across services.
require 'faraday'
require 'securerandom'
class DistributedMonitor
TRACE_HEADER = 'X-Trace-Id'
def initialize(service_name)
@service_name = service_name
@statsd = Statsd.new('localhost', 8125)
end
def call_downstream_service(url, trace_id: nil)
trace_id ||= SecureRandom.uuid
start = Time.now
conn = Faraday.new(url) do |f|
f.adapter Faraday.default_adapter
end
response = conn.get do |req|
req.headers[TRACE_HEADER] = trace_id
end
duration = ((Time.now - start) * 1000).round(2)
@statsd.timing("#{@service_name}.downstream.duration", duration,
tags: ["trace_id:#{trace_id}", "status:#{response.status}"])
{ trace_id: trace_id, response: response }
rescue StandardError => e
@statsd.increment("#{@service_name}.downstream.errors",
tags: ["trace_id:#{trace_id}", "error:#{e.class}"])
raise
end
end
Database Monitoring: Database performance directly impacts application performance. Monitoring query latency, connection pool utilization, and slow query logs identifies performance issues. Connection pool monitoring prevents connection exhaustion that causes application timeouts.
Replication lag monitoring ensures data consistency in replicated databases. High replication lag indicates capacity problems or network issues between primary and replica nodes.
CDN and Edge Network Monitoring: Content delivery networks distribute content globally. Monitoring from multiple geographic locations validates CDN performance and availability. Edge network monitoring tracks cache hit rates, origin fetch times, and geographic distribution of requests.
Container Orchestration Monitoring: Kubernetes and container orchestration platforms introduce dynamic infrastructure. Containers start and stop frequently, requiring service discovery integration for monitoring. Container health checks determine container readiness and liveness.
Resource monitoring tracks CPU, memory, and network usage per container. Container orchestration platforms expose metrics through APIs that monitoring agents consume.
Third-Party API Monitoring: Applications depend on external APIs for functionality. Monitoring third-party API availability, latency, and error rates enables proactive issue detection. Synthetic monitoring from multiple locations validates global API availability.
Rate limit monitoring tracks API usage against quotas, preventing unexpected failures from exceeded limits. Error rate monitoring identifies degraded API performance requiring fallback strategies.
Reference
Monitoring Check Types
| Check Type | Protocol | Use Case | Typical Interval |
|---|---|---|---|
| ICMP Ping | ICMP | Host reachability | 30-60 seconds |
| TCP Port | TCP | Service availability | 30-60 seconds |
| HTTP/HTTPS | HTTP/HTTPS | Web service health | 60-300 seconds |
| DNS | DNS | Name resolution | 300-600 seconds |
| SMTP | SMTP | Mail server availability | 300-600 seconds |
| Database | Native protocol | Database connectivity | 60-300 seconds |
| Certificate | TLS | Certificate expiration | 3600-86400 seconds |
| API | HTTP/HTTPS | API endpoint functionality | 60-300 seconds |
Common SNMP OIDs
| Metric | OID | Description |
|---|---|---|
| System Uptime | 1.3.6.1.2.1.1.3.0 | Time since system boot |
| System Description | 1.3.6.1.2.1.1.1.0 | System description |
| Interface Count | 1.3.6.1.2.1.2.1.0 | Number of network interfaces |
| Interface Status | 1.3.6.1.2.1.2.2.1.8 | Operational status per interface |
| Interface In Octets | 1.3.6.1.2.1.2.2.1.10 | Bytes received per interface |
| Interface Out Octets | 1.3.6.1.2.1.2.2.1.16 | Bytes transmitted per interface |
| Interface Errors In | 1.3.6.1.2.1.2.2.1.14 | Inbound errors per interface |
| Interface Errors Out | 1.3.6.1.2.1.2.2.1.20 | Outbound errors per interface |
| CPU Load 1 Minute | 1.3.6.1.4.1.2021.10.1.3.1 | CPU load average 1 minute |
| Memory Total | 1.3.6.1.4.1.2021.4.5.0 | Total system memory |
| Memory Available | 1.3.6.1.4.1.2021.4.6.0 | Available system memory |
HTTP Status Code Categories
| Range | Category | Monitoring Action |
|---|---|---|
| 1xx | Informational | Log, no alert |
| 2xx | Success | Normal operation |
| 3xx | Redirection | Verify redirect target |
| 4xx | Client Error | Alert on rate increase |
| 5xx | Server Error | Alert immediately |
Metric Types
| Type | Description | Example | Aggregation |
|---|---|---|---|
| Counter | Monotonically increasing value | Total requests | Rate, delta |
| Gauge | Point-in-time value | Current connections | Average, min, max |
| Histogram | Distribution of values | Request duration | Percentiles, average |
| Summary | Pre-computed percentiles | Response time p95 | Percentile values |
| Set | Unique value count | Unique users | Cardinality |
Alert Thresholds
| Metric | Warning Threshold | Critical Threshold |
|---|---|---|
| HTTP Availability | 99.5% | 99.0% |
| Response Time | 500ms p95 | 1000ms p95 |
| Error Rate | 1% | 5% |
| Packet Loss | 1% | 5% |
| Certificate Expiration | 30 days | 7 days |
| Disk Space | 80% | 90% |
| Memory Usage | 80% | 90% |
| CPU Usage | 70% | 85% |
Network Latency Targets
| Connection Type | Expected Latency | Quality Classification |
|---|---|---|
| Same Region | 1-5ms | Excellent |
| Continental | 20-50ms | Good |
| Intercontinental | 100-200ms | Acceptable |
| Satellite | 500-700ms | High latency |
| Mobile 4G | 30-50ms | Good |
| Mobile 3G | 100-500ms | Acceptable |
Ruby Monitoring Gems
| Gem | Purpose | Use Case |
|---|---|---|
| net-ping | ICMP, TCP, UDP ping | Host reachability checks |
| snmp | SNMP client | Network device monitoring |
| prometheus-client | Metrics export | Prometheus integration |
| statsd-ruby | StatsD client | Metrics aggregation |
| dogstatsd-ruby | DataDog client | DataDog monitoring |
| newrelic_rpm | New Relic agent | APM monitoring |
| honeybadger | Error tracking | Exception monitoring |
| sentry-ruby | Error tracking | Error aggregation |
| sys-proctable | Process table | System metrics |
| vmstat | System statistics | Resource monitoring |