CrackedRuby - Network Monitoring

Overview

Network monitoring involves the continuous observation and measurement of network infrastructure to ensure availability, performance, and security. The practice encompasses tracking bandwidth usage, latency, packet loss, device status, and security events across routers, switches, servers, and applications.

Modern applications depend on reliable network connectivity. A web application may interact with databases, caching layers, external APIs, and content delivery networks. Network monitoring provides visibility into these connections, identifying bottlenecks, failures, and security threats before they impact users.

Network monitoring operates at multiple layers of the network stack. At the physical layer, monitoring tracks device availability and hardware health. At the transport layer, it measures TCP connection states, retransmissions, and latency. At the application layer, it analyzes HTTP response times, API endpoint performance, and protocol-specific metrics.

The fundamental monitoring workflow involves data collection through agents or probes, metric aggregation in a central system, threshold-based alerting, and visualization through dashboards. Active monitoring sends synthetic requests to measure availability and response times. Passive monitoring captures actual traffic patterns without generating additional load.

require 'socket'

# Simple ICMP ping check
def check_host_reachability(host, timeout = 5)
  start_time = Time.now
  
  begin
    TCPSocket.new(host, 80, connect_timeout: timeout).close
    latency = ((Time.now - start_time) * 1000).round(2)
    { status: :reachable, latency_ms: latency }
  rescue Errno::ETIMEDOUT, Errno::ECONNREFUSED
    { status: :unreachable, latency_ms: nil }
  end
end

# => { status: :reachable, latency_ms: 45.32 }

Network monitoring distinguishes between synthetic monitoring and real user monitoring. Synthetic monitoring executes scripted transactions from controlled locations, providing consistent baselines. Real user monitoring captures actual user experience by instrumenting application traffic.

Key Principles

Network monitoring operates on several foundational principles that guide effective implementation and interpretation of results.

Continuous Collection: Network conditions change constantly. Sporadic checks miss transient issues that impact user experience. Monitoring systems collect data at regular intervals, typically ranging from one second for critical metrics to several minutes for less volatile measurements. The collection interval balances data granularity against storage and processing overhead.

Multi-Layer Visibility: Networks operate across multiple protocol layers. Monitoring a single layer provides incomplete information. A web request involves DNS resolution, TCP connection establishment, TLS negotiation, HTTP exchange, and application processing. Effective monitoring instruments each layer to pinpoint failure locations.

Baseline Establishment: Raw metrics require context. A 200ms API response time means nothing without knowing typical performance. Monitoring systems establish baselines through statistical analysis of historical data, then detect deviations that indicate problems. Baselines account for daily and weekly patterns in traffic.

Distributed Perspective: Monitoring from a single location reveals only that location's network path. Production systems serve users from multiple geographic regions through varied network paths. Distributed monitoring deploys probes across locations to capture regional differences in performance and availability.

Active and Passive Monitoring: Active monitoring sends synthetic requests to measure availability and response times under controlled conditions. Passive monitoring analyzes actual traffic without generating load. Active monitoring detects failures immediately but may miss issues affecting only specific user segments. Passive monitoring captures real user experience but requires traffic to detect problems.

Threshold-Based Alerting: Continuous monitoring generates massive data volumes. Engineers cannot manually inspect every metric. Alerting systems evaluate metrics against thresholds, triggering notifications when values exceed acceptable bounds. Effective thresholds balance sensitivity against false positive rates.

Protocol-Specific Metrics: Different protocols require different monitoring approaches. HTTP monitoring tracks status codes, response times, and content validity. DNS monitoring measures query resolution time and record accuracy. SMTP monitoring verifies mail server connectivity and delivery times. Protocol-specific monitoring provides actionable insights for troubleshooting.

The data collection mechanism varies by monitoring type. SNMP polling queries devices for management information. Flow-based monitoring analyzes NetFlow or sFlow data exported by network devices. Packet capture examines individual packets for deep protocol analysis. Log aggregation collects and parses application and system logs.

require 'snmp'

# SNMP polling for interface metrics
def collect_interface_stats(host, community = 'public')
  SNMP::Manager.open(host: host, community: community) do |manager|
    response = manager.get([
      '1.3.6.1.2.1.2.2.1.10.1',  # ifInOctets
      '1.3.6.1.2.1.2.2.1.16.1',  # ifOutOctets
      '1.3.6.1.2.1.2.2.1.8.1'    # ifOperStatus
    ])
    
    {
      bytes_in: response.varbind_list[0].value.to_i,
      bytes_out: response.varbind_list[1].value.to_i,
      status: response.varbind_list[2].value.to_i == 1 ? :up : :down
    }
  end
end

Metric aggregation reduces data volume while preserving meaningful information. Raw per-second metrics may aggregate into one-minute averages, with minimum, maximum, and percentile values retained for variance analysis. Time-series databases optimize storage and retrieval of timestamped metric data.

Ruby Implementation

Ruby provides multiple approaches for implementing network monitoring, from low-level socket operations to high-level HTTP clients and specialized monitoring gems.

Socket-Level Monitoring: The Socket library enables direct TCP and UDP communication for implementing custom protocols or performing low-level connectivity checks.

require 'socket'
require 'timeout'

class NetworkMonitor
  def check_tcp_port(host, port, timeout = 5)
    start = Time.now
    
    begin
      Timeout.timeout(timeout) do
        socket = TCPSocket.new(host, port)
        socket.close
        { 
          available: true, 
          response_time: ((Time.now - start) * 1000).round(2) 
        }
      end
    rescue Timeout::Error
      { available: false, error: 'Connection timeout' }
    rescue Errno::ECONNREFUSED
      { available: false, error: 'Connection refused' }
    rescue Errno::EHOSTUNREACH
      { available: false, error: 'Host unreachable' }
    rescue SocketError => e
      { available: false, error: "DNS or network error: #{e.message}" }
    end
  end
  
  def check_udp_service(host, port, payload, expected_response)
    socket = UDPSocket.new
    socket.send(payload, 0, host, port)
    
    begin
      Timeout.timeout(5) do
        response, = socket.recvfrom(1024)
        { success: response.include?(expected_response) }
      end
    rescue Timeout::Error
      { success: false, error: 'No response' }
    ensure
      socket.close
    end
  end
end

monitor = NetworkMonitor.new
monitor.check_tcp_port('google.com', 443)
# => { available: true, response_time: 23.45 }

HTTP Endpoint Monitoring: The Net::HTTP library and modern HTTP clients like HTTP.rb provide comprehensive request/response monitoring capabilities.

require 'net/http'
require 'uri'
require 'json'

class HTTPMonitor
  def check_endpoint(url, expected_status: 200, timeout: 10)
    uri = URI(url)
    start = Time.now
    
    Net::HTTP.start(uri.host, uri.port, 
                   use_ssl: uri.scheme == 'https',
                   open_timeout: timeout,
                   read_timeout: timeout) do |http|
      
      request = Net::HTTP::Get.new(uri)
      response = http.request(request)
      
      duration = ((Time.now - start) * 1000).round(2)
      
      {
        status_code: response.code.to_i,
        response_time: duration,
        success: response.code.to_i == expected_status,
        body_size: response.body.bytesize,
        headers: response.to_hash
      }
    end
  rescue Net::OpenTimeout
    { success: false, error: 'Connection timeout' }
  rescue Net::ReadTimeout
    { success: false, error: 'Read timeout' }
  rescue StandardError => e
    { success: false, error: e.message }
  end
  
  def check_api_endpoint(url, expected_keys: [])
    result = check_endpoint(url)
    
    if result[:success]
      begin
        json = JSON.parse(result.delete(:body))
        missing_keys = expected_keys - json.keys
        result[:valid_response] = missing_keys.empty?
        result[:missing_keys] = missing_keys unless missing_keys.empty?
      rescue JSON::ParserError
        result[:valid_response] = false
        result[:error] = 'Invalid JSON response'
      end
    end
    
    result
  end
end

monitor = HTTPMonitor.new
monitor.check_api_endpoint(
  'https://api.example.com/health',
  expected_keys: ['status', 'version']
)

DNS Monitoring: The Resolv library enables DNS query monitoring to track resolution times and validate DNS configurations.

require 'resolv'

class DNSMonitor
  def check_resolution(hostname, expected_ip: nil)
    start = Time.now
    resolver = Resolv::DNS.new(nameserver: ['8.8.8.8'])
    
    begin
      addresses = resolver.getaddresses(hostname)
      resolution_time = ((Time.now - start) * 1000).round(2)
      
      result = {
        resolved: true,
        resolution_time: resolution_time,
        addresses: addresses.map(&:to_s)
      }
      
      if expected_ip
        result[:matches_expected] = addresses.map(&:to_s).include?(expected_ip)
      end
      
      result
    rescue Resolv::ResolvError
      { resolved: false, error: 'DNS resolution failed' }
    ensure
      resolver.close
    end
  end
  
  def check_dns_records(hostname, record_type: :A)
    resolver = Resolv::DNS.new
    
    resources = case record_type
                when :A then resolver.getresources(hostname, Resolv::DNS::Resource::IN::A)
                when :MX then resolver.getresources(hostname, Resolv::DNS::Resource::IN::MX)
                when :TXT then resolver.getresources(hostname, Resolv::DNS::Resource::IN::TXT)
                end
    
    {
      record_type: record_type,
      count: resources.size,
      records: resources.map { |r| r.address.to_s rescue r.to_s }
    }
  ensure
    resolver.close
  end
end

Ping Monitoring: The net-ping gem provides ICMP, TCP, and UDP ping implementations for host reachability checks.

require 'net/ping'

class PingMonitor
  def icmp_ping(host, count: 4)
    pinger = Net::Ping::ICMP.new(host)
    results = []
    
    count.times do
      result = pinger.ping
      results << {
        success: result,
        duration: pinger.duration ? (pinger.duration * 1000).round(2) : nil
      }
      sleep 1
    end
    
    successful = results.count { |r| r[:success] }
    durations = results.select { |r| r[:duration] }.map { |r| r[:duration] }
    
    {
      packets_sent: count,
      packets_received: successful,
      packet_loss: ((count - successful) / count.to_f * 100).round(2),
      avg_duration: durations.empty? ? nil : (durations.sum / durations.size).round(2),
      min_duration: durations.min,
      max_duration: durations.max
    }
  end
  
  def tcp_ping(host, port, timeout: 5)
    pinger = Net::Ping::TCP.new(host, port, timeout)
    result = pinger.ping
    
    {
      reachable: result,
      port: port,
      duration: result ? (pinger.duration * 1000).round(2) : nil
    }
  end
end

monitor = PingMonitor.new
monitor.icmp_ping('8.8.8.8')
# => { packets_sent: 4, packets_received: 4, packet_loss: 0.0, avg_duration: 12.5, ... }

SNMP Monitoring: The snmp gem implements SNMP v1, v2c, and v3 for querying network device metrics.

require 'snmp'

class SNMPMonitor
  OID_SYSTEM_UPTIME = '1.3.6.1.2.1.1.3.0'
  OID_SYSTEM_DESCR = '1.3.6.1.2.1.1.1.0'
  OID_IF_TABLE = '1.3.6.1.2.1.2.2.1'
  
  def initialize(host, community: 'public', version: :SNMPv2c)
    @host = host
    @community = community
    @version = version
  end
  
  def get_system_info
    SNMP::Manager.open(host: @host, community: @community, version: @version) do |manager|
      response = manager.get([OID_SYSTEM_UPTIME, OID_SYSTEM_DESCR])
      
      {
        uptime_ticks: response.varbind_list[0].value.to_i,
        uptime_seconds: response.varbind_list[0].value.to_i / 100,
        description: response.varbind_list[1].value.to_s
      }
    end
  end
  
  def get_interface_stats(interface_index)
    oids = {
      in_octets: "#{OID_IF_TABLE}.10.#{interface_index}",
      out_octets: "#{OID_IF_TABLE}.16.#{interface_index}",
      in_errors: "#{OID_IF_TABLE}.14.#{interface_index}",
      out_errors: "#{OID_IF_TABLE}.20.#{interface_index}",
      status: "#{OID_IF_TABLE}.8.#{interface_index}"
    }
    
    SNMP::Manager.open(host: @host, community: @community) do |manager|
      response = manager.get(oids.values)
      
      Hash[oids.keys.zip(response.varbind_list.map(&:value))]
    end
  end
  
  def walk_interface_table
    interfaces = []
    
    SNMP::Manager.open(host: @host, community: @community) do |manager|
      manager.walk(OID_IF_TABLE) do |row|
        interfaces << {
          oid: row.name.to_s,
          value: row.value
        }
      end
    end
    
    interfaces
  end
end

Implementation Approaches

Network monitoring implementations vary based on monitoring scope, infrastructure scale, and performance requirements.

Centralized Polling: A central monitoring server periodically polls monitored targets to collect metrics. The monitoring server maintains a list of targets and check intervals, executing checks according to schedule. This approach provides simple deployment and configuration management, with all monitoring logic concentrated in a central location.

Centralized polling scales to hundreds of monitored endpoints before network bandwidth or monitoring server capacity becomes constrained. For larger deployments, multiple monitoring servers operate in parallel, each responsible for a subset of targets. Load balancing distributes checks across servers.

Advantages include simplified configuration management, consistent check execution, and centralized result storage. Disadvantages include network overhead from polling traffic, potential monitoring gaps during network partitions, and scaling limitations for very large deployments.

Distributed Agents: Monitoring agents run on each monitored system, collecting local metrics and forwarding results to a central aggregation service. Agents access local system resources directly, gathering detailed metrics unavailable through remote polling. Agent-based monitoring scales to thousands of endpoints with minimal network overhead.

The agent push model inverts the polling relationship. Rather than the monitoring server pulling metrics, agents push metrics at regular intervals. This approach traverses firewalls and NAT more reliably than polling. Agents cache metrics during network outages, transmitting buffered data when connectivity restores.

Agents require deployment and maintenance across all monitored systems. Configuration management systems like Chef, Puppet, or Ansible automate agent deployment. Containerized environments use sidecar containers for monitoring agents.

Hybrid Architecture: Large-scale monitoring combines centralized polling for external endpoints with distributed agents for internal systems. External website monitoring uses centralized probes from multiple geographic locations. Internal server monitoring uses local agents for detailed metrics. This approach optimizes network efficiency while maintaining comprehensive visibility.

Real-Time Stream Processing: High-frequency monitoring generates massive data volumes unsuitable for batch processing. Stream processing systems like Apache Kafka consume monitoring metrics in real-time, applying filtering, aggregation, and alerting rules before persisting results. This approach handles millions of metrics per second while maintaining sub-second alerting latency.

Monitoring data flows through ingestion, processing, storage, and visualization stages. Ingestion normalizes data from heterogeneous sources into common formats. Processing applies aggregation, anomaly detection, and threshold evaluation. Storage optimizes metric retention and query performance. Visualization presents metrics through dashboards and graphs.

Time-Series Database Storage: Time-series databases optimize metric storage and retrieval. Specialized storage engines achieve high compression ratios through delta encoding and downsampling. Query engines support temporal aggregations and time-based filtering efficiently. Retention policies automatically downsample and expire old data.

Popular time-series databases include Prometheus, InfluxDB, and TimescaleDB. Selection depends on query patterns, data volume, and integration requirements. Prometheus excels at short-term storage with powerful query language. InfluxDB provides flexible schema and clustering. TimescaleDB extends PostgreSQL with time-series optimizations.

Tools & Ecosystem

The Ruby ecosystem includes several monitoring tools and gems for implementing network monitoring solutions.

Prometheus Client: The prometheus-client gem exports application metrics in Prometheus format, enabling integration with Prometheus monitoring infrastructure.

require 'prometheus/client'
require 'prometheus/client/push'

# Initialize registry and metrics
prometheus = Prometheus::Client.registry

http_requests = prometheus.counter(
  :http_requests_total,
  docstring: 'Total HTTP requests',
  labels: [:method, :path, :status]
)

request_duration = prometheus.histogram(
  :http_request_duration_seconds,
  docstring: 'HTTP request duration',
  labels: [:method, :path],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
)

# Instrument application code
def handle_request(method, path)
  start = Time.now
  
  # Application logic
  status = perform_request(method, path)
  
  duration = Time.now - start
  http_requests.increment(labels: { method: method, path: path, status: status })
  request_duration.observe(duration, labels: { method: method, path: path })
  
  status
end

# Push metrics to Prometheus Pushgateway
Prometheus::Client::Push.new(
  job: 'batch_job',
  gateway: 'http://pushgateway:9091'
).add(prometheus)

StatsD Client: The statsd-ruby gem sends metrics to StatsD servers for aggregation and forwarding to monitoring backends.

require 'statsd'

statsd = Statsd.new('localhost', 8125)

# Increment counters
statsd.increment('api.requests')
statsd.increment('api.errors', tags: ['endpoint:users', 'status:500'])

# Record timing
statsd.timing('api.response_time', 250)

# Time block execution
statsd.time('database.query') do
  # Database operation
end

# Gauge for current values
statsd.gauge('active_connections', connection_pool.size)

# Set for unique values
statsd.set('unique_users', user_id)

HTTP Monitoring: The faraday gem provides middleware for HTTP request instrumentation and monitoring.

require 'faraday'

class MonitoringMiddleware < Faraday::Middleware
  def initialize(app, statsd:)
    super(app)
    @statsd = statsd
  end
  
  def call(env)
    start = Time.now
    
    @app.call(env).on_complete do |response|
      duration = ((Time.now - start) * 1000).round(2)
      
      @statsd.timing('http.request.duration', duration)
      @statsd.increment('http.request.count', 
                       tags: ["status:#{response.status}", "host:#{env.url.host}"])
      
      if response.status >= 500
        @statsd.increment('http.request.errors', 
                         tags: ["status:#{response.status}"])
      end
    end
  end
end

# Configure Faraday with monitoring
conn = Faraday.new(url: 'https://api.example.com') do |f|
  f.use MonitoringMiddleware, statsd: statsd
  f.adapter Faraday.default_adapter
end

System Metrics: The sys-proctable and vmstat gems collect system-level metrics for host monitoring.

require 'sys/proctable'
require 'vmstat'

class SystemMonitor
  def collect_process_metrics
    processes = Sys::ProcTable.ps
    
    {
      total_processes: processes.count,
      total_threads: processes.sum { |p| p.nlwp.to_i },
      cpu_percentage: processes.sum { |p| p.pctcpu.to_f },
      memory_rss_mb: processes.sum { |p| p.rss.to_i } / 1024
    }
  end
  
  def collect_system_metrics
    memory = Vmstat.memory
    cpu = Vmstat.cpu
    
    {
      memory: {
        total_mb: memory.pagesize * memory.wired / (1024 * 1024),
        active_mb: memory.pagesize * memory.active / (1024 * 1024),
        free_mb: memory.pagesize * memory.free / (1024 * 1024)
      },
      cpu: {
        user: cpu.user,
        system: cpu.system,
        idle: cpu.idle
      },
      load_average: Vmstat.load_average.one_minute
    }
  end
end

Application Performance Monitoring: Full-featured APM solutions like New Relic, DataDog, and Scout provide comprehensive monitoring with minimal instrumentation.

The monitoring gem ecosystem continues expanding with specialized tools for specific protocols and platforms. Selecting appropriate tools depends on monitoring requirements, existing infrastructure, and team expertise.

Performance Considerations

Network monitoring introduces overhead that affects both monitored systems and monitoring infrastructure. Understanding performance characteristics enables effective monitoring without degrading system performance.

Monitoring Overhead: Each monitoring check consumes CPU, memory, network bandwidth, and storage. Active HTTP checks generate network requests that increase server load. Passive packet capture processes every network packet, consuming CPU cycles. Balancing monitoring granularity against resource consumption requires careful tuning.

High-frequency polling generates significant network traffic. Monitoring 1,000 endpoints every second produces 1,000 requests per second, potentially saturating network links or overwhelming monitored systems. Adjusting check intervals based on metric volatility reduces unnecessary overhead. Critical metrics warrant frequent checks; stable metrics require infrequent validation.

Collection Efficiency: Efficient metric collection minimizes monitoring overhead. Batching multiple metrics into single requests reduces network round trips. Connection pooling eliminates repeated connection establishment overhead. Asynchronous collection prevents blocking on slow endpoints.

require 'concurrent'
require 'net/http'

class EfficientMonitor
  def initialize(concurrency: 10)
    @pool = Concurrent::FixedThreadPool.new(concurrency)
  end
  
  def check_endpoints_parallel(endpoints)
    futures = endpoints.map do |endpoint|
      Concurrent::Future.execute(executor: @pool) do
        check_endpoint(endpoint)
      end
    end
    
    futures.map(&:value)
  end
  
  def check_endpoint(url)
    uri = URI(url)
    start = Time.now
    
    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      http.read_timeout = 5
      http.open_timeout = 5
      
      response = http.get(uri.path)
      duration = ((Time.now - start) * 1000).round(2)
      
      { url: url, status: response.code.to_i, duration: duration }
    end
  rescue StandardError => e
    { url: url, error: e.message }
  end
end

monitor = EfficientMonitor.new(concurrency: 20)
results = monitor.check_endpoints_parallel([
  'https://api1.example.com/health',
  'https://api2.example.com/health',
  'https://api3.example.com/health'
])

Metric Aggregation: Raw metrics generate enormous data volumes. Aggregating metrics before storage reduces storage requirements and query costs. Computing statistics like average, minimum, maximum, and percentiles during collection eliminates need for post-processing.

Time-series databases apply downsampling to reduce resolution of old data. Recent data maintains high resolution for detailed analysis. Older data aggregates into hourly or daily summaries. Retention policies automatically delete data exceeding configured age.

Query Performance: Monitoring dashboards execute numerous queries simultaneously. Inefficient queries degrade dashboard performance and increase database load. Proper indexing, query optimization, and caching improve query response times.

Pre-computing frequently queried aggregations reduces query complexity. Materialized views maintain pre-aggregated results updated incrementally. This trades increased storage for faster query response.

Alerting Latency: Alert delivery latency affects incident response time. Monitoring systems evaluate alert conditions continuously, typically every 30-60 seconds. Complex alert conditions increase evaluation time. Keeping alert logic simple maintains low latency.

Batching alerts reduces notification volume but increases delivery latency. Immediate notification ensures rapid response at the cost of potential alert fatigue from high-frequency notifications.

Security Implications

Network monitoring accesses sensitive information and requires careful security consideration to prevent data exposure and system compromise.

Credential Management: Monitoring systems authenticate to monitored services using credentials. Storing credentials securely prevents unauthorized access. Secret management systems like HashiCorp Vault or AWS Secrets Manager encrypt credentials at rest and provide audit logging.

Avoid hardcoding credentials in monitoring scripts. Environment variables provide basic credential injection but expose secrets in process listings. Dedicated secret management integrates with monitoring tools for secure credential retrieval.

require 'aws-sdk-secretsmanager'

class SecureMonitor
  def initialize(region: 'us-east-1')
    @secrets_client = Aws::SecretsManager::Client.new(region: region)
  end
  
  def get_monitoring_credentials(secret_name)
    response = @secrets_client.get_secret_value(secret_id: secret_name)
    JSON.parse(response.secret_string)
  rescue Aws::SecretsManager::Errors::ResourceNotFoundException
    raise "Secret #{secret_name} not found"
  end
  
  def monitor_with_secure_credentials(url, secret_name)
    credentials = get_monitoring_credentials(secret_name)
    
    uri = URI(url)
    request = Net::HTTP::Get.new(uri)
    request.basic_auth(credentials['username'], credentials['password'])
    
    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      response = http.request(request)
      { status: response.code.to_i, success: response.is_a?(Net::HTTPSuccess) }
    end
  end
end

Sensitive Data Exposure: Network monitoring captures potentially sensitive information including authentication tokens, API keys, and personal data in URLs or request bodies. Proper data sanitization prevents sensitive information from appearing in logs or metrics.

Implementing data masking redacts sensitive patterns before logging. Regular expressions identify common sensitive patterns like credit card numbers, Social Security numbers, and API keys. Masking replaces identified values with placeholder text.

TLS Certificate Validation: Monitoring HTTPS endpoints requires proper TLS certificate validation. Disabling certificate verification creates man-in-the-middle vulnerabilities. Monitoring systems must validate certificates against trusted certificate authorities.

Custom certificate authorities require explicit trust configuration. Monitoring agents need access to CA certificate bundles for validation. Certificate expiration monitoring prevents outages from expired certificates.

Access Control: Monitoring dashboards display sensitive operational information. Proper authentication and authorization prevent unauthorized access. Role-based access control limits data visibility based on user permissions.

Monitoring agents running on production systems require minimal privileges. Service accounts with limited permissions reduce blast radius from compromised agents. Network segmentation restricts monitoring traffic to dedicated monitoring networks.

Data Retention: Monitoring data retention policies balance compliance requirements against storage costs. Regulations like GDPR require data minimization and retention limits. Configuring appropriate retention periods prevents indefinite storage of sensitive operational data.

Time-series databases support automatic data expiration based on age. Setting retention policies ensures old data deletes automatically. Critical security events may warrant longer retention than routine metrics.

Real-World Applications

Network monitoring operates across diverse production environments, each with specific monitoring requirements and challenges.

Microservices Architecture: Distributed microservices increase monitoring complexity. Each service requires health checks, performance monitoring, and dependency tracking. Service mesh technologies like Istio automatically instrument service-to-service communication, providing detailed latency and error rate metrics.

Distributed tracing tracks requests across service boundaries, identifying performance bottlenecks in complex call chains. Trace context propagation attaches trace IDs to requests, enabling correlation of logs and metrics across services.

require 'faraday'
require 'securerandom'

class DistributedMonitor
  TRACE_HEADER = 'X-Trace-Id'
  
  def initialize(service_name)
    @service_name = service_name
    @statsd = Statsd.new('localhost', 8125)
  end
  
  def call_downstream_service(url, trace_id: nil)
    trace_id ||= SecureRandom.uuid
    start = Time.now
    
    conn = Faraday.new(url) do |f|
      f.adapter Faraday.default_adapter
    end
    
    response = conn.get do |req|
      req.headers[TRACE_HEADER] = trace_id
    end
    
    duration = ((Time.now - start) * 1000).round(2)
    
    @statsd.timing("#{@service_name}.downstream.duration", duration,
                   tags: ["trace_id:#{trace_id}", "status:#{response.status}"])
    
    { trace_id: trace_id, response: response }
  rescue StandardError => e
    @statsd.increment("#{@service_name}.downstream.errors",
                      tags: ["trace_id:#{trace_id}", "error:#{e.class}"])
    raise
  end
end

Database Monitoring: Database performance directly impacts application performance. Monitoring query latency, connection pool utilization, and slow query logs identifies performance issues. Connection pool monitoring prevents connection exhaustion that causes application timeouts.

Replication lag monitoring ensures data consistency in replicated databases. High replication lag indicates capacity problems or network issues between primary and replica nodes.

CDN and Edge Network Monitoring: Content delivery networks distribute content globally. Monitoring from multiple geographic locations validates CDN performance and availability. Edge network monitoring tracks cache hit rates, origin fetch times, and geographic distribution of requests.

Container Orchestration Monitoring: Kubernetes and container orchestration platforms introduce dynamic infrastructure. Containers start and stop frequently, requiring service discovery integration for monitoring. Container health checks determine container readiness and liveness.

Resource monitoring tracks CPU, memory, and network usage per container. Container orchestration platforms expose metrics through APIs that monitoring agents consume.

Third-Party API Monitoring: Applications depend on external APIs for functionality. Monitoring third-party API availability, latency, and error rates enables proactive issue detection. Synthetic monitoring from multiple locations validates global API availability.

Rate limit monitoring tracks API usage against quotas, preventing unexpected failures from exceeded limits. Error rate monitoring identifies degraded API performance requiring fallback strategies.

Reference

Monitoring Check Types

Check Type	Protocol	Use Case	Typical Interval
ICMP Ping	ICMP	Host reachability	30-60 seconds
TCP Port	TCP	Service availability	30-60 seconds
HTTP/HTTPS	HTTP/HTTPS	Web service health	60-300 seconds
DNS	DNS	Name resolution	300-600 seconds
SMTP	SMTP	Mail server availability	300-600 seconds
Database	Native protocol	Database connectivity	60-300 seconds
Certificate	TLS	Certificate expiration	3600-86400 seconds
API	HTTP/HTTPS	API endpoint functionality	60-300 seconds

Common SNMP OIDs

Metric	OID	Description
System Uptime	1.3.6.1.2.1.1.3.0	Time since system boot
System Description	1.3.6.1.2.1.1.1.0	System description
Interface Count	1.3.6.1.2.1.2.1.0	Number of network interfaces
Interface Status	1.3.6.1.2.1.2.2.1.8	Operational status per interface
Interface In Octets	1.3.6.1.2.1.2.2.1.10	Bytes received per interface
Interface Out Octets	1.3.6.1.2.1.2.2.1.16	Bytes transmitted per interface
Interface Errors In	1.3.6.1.2.1.2.2.1.14	Inbound errors per interface
Interface Errors Out	1.3.6.1.2.1.2.2.1.20	Outbound errors per interface
CPU Load 1 Minute	1.3.6.1.4.1.2021.10.1.3.1	CPU load average 1 minute
Memory Total	1.3.6.1.4.1.2021.4.5.0	Total system memory
Memory Available	1.3.6.1.4.1.2021.4.6.0	Available system memory

HTTP Status Code Categories

Range	Category	Monitoring Action
1xx	Informational	Log, no alert
2xx	Success	Normal operation
3xx	Redirection	Verify redirect target
4xx	Client Error	Alert on rate increase
5xx	Server Error	Alert immediately

Metric Types

Type	Description	Example	Aggregation
Counter	Monotonically increasing value	Total requests	Rate, delta
Gauge	Point-in-time value	Current connections	Average, min, max
Histogram	Distribution of values	Request duration	Percentiles, average
Summary	Pre-computed percentiles	Response time p95	Percentile values
Set	Unique value count	Unique users	Cardinality

Alert Thresholds

Metric	Warning Threshold	Critical Threshold
HTTP Availability	99.5%	99.0%
Response Time	500ms p95	1000ms p95
Error Rate	1%	5%
Packet Loss	1%	5%
Certificate Expiration	30 days	7 days
Disk Space	80%	90%
Memory Usage	80%	90%
CPU Usage	70%	85%

Network Latency Targets

Connection Type	Expected Latency	Quality Classification
Same Region	1-5ms	Excellent
Continental	20-50ms	Good
Intercontinental	100-200ms	Acceptable
Satellite	500-700ms	High latency
Mobile 4G	30-50ms	Good
Mobile 3G	100-500ms	Acceptable

Ruby Monitoring Gems

Gem	Purpose	Use Case
net-ping	ICMP, TCP, UDP ping	Host reachability checks
snmp	SNMP client	Network device monitoring
prometheus-client	Metrics export	Prometheus integration
statsd-ruby	StatsD client	Metrics aggregation
dogstatsd-ruby	DataDog client	DataDog monitoring
newrelic_rpm	New Relic agent	APM monitoring
honeybadger	Error tracking	Exception monitoring
sentry-ruby	Error tracking	Error aggregation
sys-proctable	Process table	System metrics
vmstat	System statistics	Resource monitoring

Network Monitoring