CrackedRuby - Horizontal vs Vertical Scaling

Overview

Scaling refers to the process of increasing system capacity to handle growing workloads. Two fundamental approaches exist: vertical scaling (scaling up) and horizontal scaling (scaling out). Vertical scaling increases the resources of a single machine—adding more CPU cores, RAM, or faster storage. Horizontal scaling adds more machines to distribute the workload across multiple nodes.

The distinction matters because each approach carries different implications for cost, complexity, reliability, and maximum capacity. A vertically scaled system runs on progressively larger machines, while a horizontally scaled system distributes work across many smaller machines. This architectural choice affects database design, session management, deployment strategies, and operational procedures.

Vertical scaling dominated early computing when distributed systems presented significant technical challenges. As cloud computing matured and distributed system patterns became standardized, horizontal scaling gained favor for its ability to scale beyond single-machine limits and provide fault tolerance through redundancy.

Most production systems use both approaches at different layers. A database might scale vertically to a certain point before implementing horizontal sharding, while application servers scale horizontally from the start. The choice depends on the specific component being scaled and the trade-offs acceptable for that layer.

Key Principles

Vertical scaling increases the capacity of individual machines by upgrading hardware components. This involves replacing existing CPUs with faster models, adding RAM modules, upgrading to NVMe storage, or moving to machines with higher specifications. The process requires downtime during hardware replacement, though cloud providers offer instance resizing with brief interruptions.

The vertical scaling approach maintains a single logical server from the application's perspective. Code runs on one machine with direct access to local memory and storage. Inter-process communication happens through local mechanisms like Unix sockets or shared memory. Database connections pool to a single instance, and sessions persist in local memory without coordination overhead.

Vertical scaling hits physical limits. Individual machines max out at hardware boundaries—the largest available CPUs, maximum addressable RAM, or fastest available storage. Cloud providers offer instances up to hundreds of cores and terabytes of memory, but these limits exist and costs increase non-linearly at the high end.

Horizontal scaling distributes workload across multiple machines. Each node runs a complete copy of the application, and load balancers distribute incoming requests. State management becomes distributed—sessions stored in Redis or databases, file uploads sent to object storage, and database queries routed to read replicas or sharded clusters.

Adding capacity horizontally involves provisioning new identical machines and registering them with load balancers. Most cloud platforms automate this through auto-scaling groups that monitor metrics and adjust capacity. The process happens without downtime since new nodes join while existing ones continue serving traffic.

Horizontal scaling introduces distributed systems complexity. Network calls replace local operations, creating latency and failure modes. Clock synchronization affects ordering guarantees. Distributed transactions require coordination protocols. Cache invalidation spreads across nodes. These challenges require architectural patterns like eventual consistency, idempotency, and circuit breakers.

The CAP theorem constrains horizontally scaled systems: achieving consistency, availability, and partition tolerance simultaneously proves impossible. Systems choose between strong consistency (CP) or high availability (AP) during network partitions. Vertically scaled single-node systems avoid this trade-off but sacrifice the redundancy that prevents single points of failure.

Stateless design enables horizontal scaling. Applications that store no local state can route requests to any node. Stateful components require session stickiness, data replication, or distributed state stores. Ruby applications typically store sessions in external systems, upload files to shared storage, and maintain database connections to clustered databases.

Design Considerations

Vertical scaling makes sense for databases requiring strong consistency. PostgreSQL and MySQL perform best on powerful single machines with fast storage and large RAM for caching. Database operations benefit from local disk access and in-memory operations without network overhead. Scaling to machines with 96 cores and 768GB RAM handles substantial workloads before requiring horizontal scaling through replication or sharding.

Applications with significant state benefit from vertical scaling when coordination costs outweigh hardware costs. Real-time gaming servers maintaining player positions and physics simulations run more efficiently on powerful machines than distributed across nodes requiring constant synchronization. The programming model remains simpler without distributed state management.

Legacy applications lacking stateless design often scale vertically more easily than refactoring for horizontal distribution. Migrating session state, file storage, and local caching to external systems requires architectural changes. Vertical scaling postpones this complexity, though it sets an eventual ceiling on capacity.

Cost considerations favor vertical scaling at small scales and horizontal scaling at large scales. Running one 16-core machine costs less than eight 2-core machines due to operational overhead and licensing. At hundreds or thousands of cores, commodity hardware deployed horizontally costs less than fewer massive machines. Cloud pricing reflects this with premiums on largest instances.

Horizontal scaling provides fault tolerance through redundancy. Multiple application servers mean individual failures don't cause outages. Load balancers detect unhealthy nodes and route traffic elsewhere. Database replicas take over when primaries fail. Vertical scaling creates single points of failure—when the machine fails, the service fails unless cold standby machines wait ready.

Development velocity considerations differ between approaches. Vertical scaling requires less code change—applications continue treating resources as local. Horizontal scaling demands architectural patterns for distributed systems, session management, and state coordination. Teams experienced with distributed systems implement horizontal scaling efficiently, while teams new to these patterns face steeper learning curves.

Performance characteristics vary by workload type. CPU-bound tasks benefit from faster processors available through vertical scaling. I/O-bound tasks benefit from parallelism across horizontally scaled nodes. Memory-intensive operations favor vertical scaling's unified memory space over distributed caching. Network-bound services gain from horizontal scaling's ability to distribute connection handling.

Deployment complexity increases with horizontal scaling. Single machines require straightforward deployment processes. Distributed systems need orchestration, health checks, rolling updates, and service discovery. Container orchestration platforms like Kubernetes manage this complexity but introduce operational overhead. Vertical scaling maintains simpler deployment at the cost of more disruptive updates.

Implementation Approaches

Cloud platform scaling uses provider-specific tools and services. AWS offers EC2 Auto Scaling Groups that maintain desired instance counts, scale based on CloudWatch metrics, and integrate with Elastic Load Balancing. Instances launch from AMIs containing configured application code. Target tracking policies adjust capacity to maintain metrics like CPU utilization at specified levels.

Container orchestration platforms abstract scaling across cloud providers or on-premises infrastructure. Kubernetes Horizontal Pod Autoscaler monitors metrics and adjusts replica counts. Deployments define desired state, and controllers maintain that state by creating or terminating pods. Services provide stable endpoints backed by changing pod sets. Cluster autoscalers add nodes when pods remain unschedulable.

Database scaling strategies separate read and write workloads. Primary databases handle writes and maintain authoritative state. Read replicas asynchronously copy data and serve read queries. Applications route writes to primaries and reads to replicas. Replica lag introduces eventual consistency—reads might not reflect recent writes. Connection pools distribute queries across replicas for load distribution.

Database sharding horizontally partitions data across multiple database instances. Each shard contains a subset of total data, determined by sharding key selection. Applications route queries to appropriate shards based on keys. Range-based sharding uses key ranges (users A-M, N-Z). Hash-based sharding applies hash functions to distribute data evenly. Cross-shard queries become complex or impossible.

Caching layers scale horizontally to reduce database load. Distributed caches like Redis Cluster or Memcached spread cached data across nodes. Cache keys hash to specific nodes for consistent routing. Applications check caches before database queries. Cache invalidation requires careful coordination to prevent stale data. Cache miss thundering herds need protection through request coalescing or probabilistic early expiration.

Message queue scaling distributes asynchronous workloads. Queue services like RabbitMQ or Amazon SQS buffer requests for background processing. Worker processes scale horizontally to consume messages in parallel. Each worker polls for messages, processes them, and acknowledges completion. Failed messages retry or move to dead letter queues. Queue depth metrics drive autoscaling decisions.

Session management strategies enable stateless application scaling. External session stores like Redis maintain session data accessible from any application node. Encrypted session cookies store small amounts of session data client-side. Sticky sessions route user requests to consistent nodes but reduce load distribution effectiveness and complicate rolling updates.

Static asset scaling separates file serving from application logic. Content delivery networks cache and serve static files from edge locations near users. Object storage services like Amazon S3 provide durable storage for uploads. Applications generate signed URLs for direct client access to objects. CloudFront distributions cache objects and serve them from edge locations with low latency.

Ruby Implementation

Ruby web applications scale horizontally through application server processes. Puma, the default Rails server, runs multiple worker processes, each handling concurrent requests through threads. Configuration specifies worker count and thread pools:

# config/puma.rb
workers ENV.fetch("WEB_CONCURRENCY") { 4 }
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count

preload_app!

on_worker_boot do
  ActiveRecord::Base.establish_connection
end

Each Puma worker forks from a master process after code loading, sharing memory through copy-on-write. Thread pools within workers handle concurrent requests. The configuration scales vertically through more threads per worker and horizontally through more worker processes across machines.

Database connection pools manage concurrent query execution. ActiveRecord establishes connection pools sized by thread count. Each thread checks out connections for queries, returning them to the pool afterward:

# config/database.yml
production:
  adapter: postgresql
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  timeout: 5000
  
# Database URL includes pool size
DATABASE_URL=postgres://user:pass@host/db?pool=20

Connection pool exhaustion occurs when all connections are in use. Requests wait for available connections until timeout expires. Horizontally scaling application servers requires scaling database connections proportionally. A database supporting 100 connections across 10 application servers needs 10 connections per server.

Session stores determine scaling architecture. Cookie-based sessions maintain state client-side, enabling fully stateless scaling:

# config/initializers/session_store.rb
Rails.application.config.session_store :cookie_store,
  key: '_app_session',
  secure: Rails.env.production?,
  same_site: :lax

Redis session stores enable server-side session state accessible from any application node:

# Gemfile
gem 'redis-rails'

# config/initializers/session_store.rb
Rails.application.config.session_store :redis_store,
  servers: ["redis://localhost:6379/0/session"],
  expire_after: 90.minutes,
  key: "_app_session"

Background job processing scales through worker processes consuming from shared queues. Sidekiq uses Redis-backed queues and runs concurrent jobs through threads:

# config/sidekiq.yml
:concurrency: 10
:queues:
  - critical
  - default
  - low_priority

# Worker class
class ReportGenerator
  include Sidekiq::Worker
  
  sidekiq_options queue: :default, retry: 3
  
  def perform(user_id)
    user = User.find(user_id)
    report = user.generate_report
    ReportMailer.send_report(report).deliver_now
  end
end

Scaling background processing horizontally involves running multiple Sidekiq processes across machines. Each process polls Redis queues for pending jobs. Job distribution happens automatically through Redis's blocking pop operations. Failed jobs retry according to configured strategies.

Cache scaling distributes cached data across Redis nodes. Redis Cluster shards data based on hash slots:

# config/initializers/redis.rb
redis_config = {
  cluster: [
    "redis://node1:6379",
    "redis://node2:6379",
    "redis://node3:6379"
  ],
  timeout: 1,
  reconnect_attempts: 3
}

$redis = Redis.new(redis_config)

# Caching in application code
def expensive_calculation(id)
  cache_key = "calculation:#{id}"
  
  $redis.get(cache_key) || begin
    result = perform_calculation(id)
    $redis.setex(cache_key, 3600, result)
    result
  end
end

File upload handling requires shared storage for horizontal scaling. Direct uploads to S3 remove upload processing from application servers:

# app/models/document.rb
class Document < ApplicationRecord
  has_one_attached :file
end

# config/storage.yml
amazon:
  service: S3
  access_key_id: <%= ENV['AWS_ACCESS_KEY_ID'] %>
  secret_access_key: <%= ENV['AWS_SECRET_ACCESS_KEY'] %>
  region: us-east-1
  bucket: app-uploads

# config/environments/production.rb
config.active_storage.service = :amazon

ActiveStorage generates presigned URLs for direct client uploads, bypassing application servers entirely. This architecture scales horizontally without shared filesystem requirements.

Practical Examples

An e-commerce platform starts on a single server handling 1,000 daily orders. As traffic grows to 10,000 daily orders, response times degrade. The initial vertical scaling approach upgrades from a 2-core, 4GB instance to an 8-core, 32GB instance. Database queries speed up with more memory for caching. Application requests process faster with additional cores. This scaling handles growth for several months.

At 50,000 daily orders, the single server reaches limits. Peak traffic causes request queuing and timeouts. The team implements horizontal scaling by deploying three application servers behind an Nginx load balancer:

# nginx.conf upstream configuration
upstream app_servers {
  server app1.internal:3000;
  server app2.internal:3000;
  server app3.internal:3000;
}

server {
  listen 80;
  
  location / {
    proxy_pass http://app_servers;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
  }
}

Sessions move from cookie storage to Redis for cross-server accessibility. File uploads transition to S3. The database remains vertically scaled on a powerful instance. This architecture handles 200,000 daily orders across horizontally scaled application servers.

A real-time analytics API processes data streams and serves query results. Initial vertical scaling on a 16-core machine handles 100 requests per second. Query processing involves complex aggregations over large datasets. Memory caching speeds frequent queries, and the powerful CPU processes aggregations quickly.

Growth to 500 requests per second requires horizontal scaling. The team shards analytics data by date ranges—recent data on hot shards, historical data on cold storage. Query routing logic directs requests to appropriate shards:

class AnalyticsQuery
  def self.fetch(start_date, end_date)
    shards = determine_shards(start_date, end_date)
    
    results = shards.map do |shard|
      Thread.new do
        shard.query(start_date, end_date)
      end
    end.map(&:value)
    
    merge_results(results)
  end
  
  private
  
  def self.determine_shards(start_date, end_date)
    date_range = start_date..end_date
    SHARD_MAPPING.select { |range, _| range.overlaps?(date_range) }
                 .values
  end
end

Read replicas distribute query load. Write operations go to the primary database, reads distribute across replicas. Replica lag monitoring ensures queries hit sufficiently up-to-date replicas for application requirements.

A SaaS application processes background jobs for report generation. Initial vertical scaling uses one machine with 24 cores running Sidekiq with high concurrency. Each report generation takes 30 seconds of CPU time. The single machine processes 2,880 reports per hour at full utilization.

Customer growth requires processing 10,000 reports per hour. Horizontal scaling deploys 10 worker machines, each running Sidekiq with 10 concurrent threads. The 100 total workers process reports in parallel, achieving the required throughput:

# kubernetes deployment for workers
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sidekiq-workers
spec:
  replicas: 10
  template:
    spec:
      containers:
      - name: sidekiq
        image: app:latest
        command: ["bundle", "exec", "sidekiq"]
        env:
        - name: REDIS_URL
          value: "redis://redis-cluster:6379"
        - name: RAILS_MAX_THREADS
          value: "10"
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"

Auto-scaling policies monitor queue depth and adjust worker count dynamically. During off-peak hours, worker count drops to 3. During peak processing, workers scale to 15. This approach optimizes cost while meeting processing requirements.

Performance Considerations

Vertical scaling improves single-request performance through faster processors and more memory. Database queries benefit from CPU cache locality and high memory bandwidth. Application code executes faster on processors with higher clock speeds and better instruction-per-cycle performance. Large memory capacities cache entire working sets, eliminating disk I/O.

Horizontal scaling improves throughput through parallel request handling but doesn't accelerate individual requests. Each request still processes at single-node speed. Response time improvements come from reduced queuing when load distributes across nodes. The aggregate throughput grows linearly with node count for stateless workloads.

Network latency affects horizontally scaled systems. Remote cache hits add 1-2ms versus microseconds for local memory. Database queries across networks add roundtrip latency. Service mesh calls between microservices compound latency through multiple hops. Careful architecture minimizes remote calls in hot paths.

Connection pool sizing determines throughput limits. Application threads require database connections for query execution. Too few connections create bottlenecks; too many overwhelm databases. The formula connections = threads × workers × instances must stay below database limits:

# Calculate total connections
app_servers = 5
puma_workers = 4
puma_threads = 5
total_connections = app_servers * puma_workers * puma_threads
# => 100 connections needed

# Database must support 100+ connections
# Add buffer for background jobs and migrations
required_capacity = total_connections * 1.2
# => 120 connections recommended

Cache hit rates determine scaling effectiveness. High cache hit rates reduce database load, allowing more traffic per database instance. Low hit rates cause cache overhead without benefit. Cache warming strategies preload frequently accessed data. Cache eviction policies like LRU balance hit rates against memory constraints.

Query optimization reduces database load more effectively than scaling. Proper indexing, query planning, and schema design prevent scaling requirements. An unoptimized query forcing horizontal scaling might run efficiently on a vertically scaled database with correct indexes. Profile queries before scaling.

Serialization overhead affects distributed cache performance. Ruby's Marshal.dump and Marshal.load consume CPU cycles. JSON serialization offers cross-language compatibility at performance cost. MessagePack provides faster serialization:

# Compare serialization performance
data = { users: User.limit(100).map(&:attributes) }

# Marshal (Ruby-specific, fast)
marshaled = Marshal.dump(data)
Marshal.load(marshaled)

# JSON (portable, slower)
json = JSON.generate(data)
JSON.parse(json)

# MessagePack (portable, fast)
packed = MessagePack.pack(data)
MessagePack.unpack(packed)

Load balancer algorithms affect request distribution. Round-robin distributes requests evenly but ignores node capacity differences. Least connections route to nodes handling fewer requests, adapting to varying request processing times. IP hash provides session affinity without sticky sessions but creates uneven distribution.

Auto-scaling lag introduces temporary capacity gaps. Metrics collection, threshold evaluation, and instance provisioning take minutes. During traffic spikes, existing instances handle excess load before new capacity comes online. Pre-scaling based on predictable patterns prevents this lag. Scheduled scaling adds capacity before expected load increases.

Real-World Applications

Web application deployments typically horizontally scale application tiers while vertically scaling databases. Ruby on Rails applications run across multiple instances in AWS Auto Scaling Groups behind Application Load Balancers. Each instance runs Puma with 4 workers and 5 threads, sized for 2-core instances. Traffic increases trigger new instance launches within 2-3 minutes.

Database primary instances scale vertically up to db.r5.24xlarge with 96 cores and 768GB RAM. Read replicas horizontally scale to distribute read traffic. Applications route write operations to primaries through writer endpoints, and read operations to reader endpoints backed by replica pools. Replica lag monitoring redirects reads requiring fresh data to primaries.

Background processing systems scale worker pools based on queue depth. Sidekiq workers run in separate deployment groups from web servers. Queue depth metrics trigger scaling—when queues exceed 1,000 jobs, additional workers launch. When queues drain below 100 jobs, workers terminate. This dynamic scaling optimizes processing costs.

Caching layers use Redis Cluster for horizontal scaling. Cache data shards across nodes based on key hashing. Applications connect to cluster endpoints, and Redis automatically routes requests to correct nodes. Cluster topology changes transparently to applications. Memory usage metrics trigger node additions to the cluster.

API gateways aggregate backend microservices. Each service scales independently based on its traffic patterns. User services might scale to 20 instances while payment services run 5 instances. API Gateway routes requests to appropriate service pools. Circuit breakers prevent cascading failures when services experience issues.

Monitoring systems track per-instance and aggregate metrics. Instance metrics identify individual node issues. Aggregate metrics show overall system capacity. CloudWatch dashboards display request rates, error rates, response times, and resource utilization. Alarms trigger when metrics exceed thresholds, indicating scaling needs.

Database connection management requires careful planning. Connection pooling configuration must account for total application instances. A database supporting 500 connections with 100 application instances means 5 connections per instance. PgBouncer provides connection pooling layers that reduce database connection requirements:

# pgbouncer.ini configuration
[databases]
production = host=primary-db.internal dbname=app

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5

# Application connects to PgBouncer
DATABASE_URL=postgres://user:pass@pgbouncer:6432/production

Blue-green deployments minimize downtime during scaling changes. New application versions deploy to separate infrastructure ("green") while current versions run on existing infrastructure ("blue"). Traffic switches to green after health checks pass. This approach enables architecture changes, including scaling strategy modifications, without service interruptions.

Cost optimization balances performance and expense. Horizontally scaled systems use smaller instances with lower per-hour costs but higher total costs for equivalent capacity due to overhead. Vertically scaled systems use fewer, larger instances with premium pricing but lower management overhead. Spot instances reduce horizontal scaling costs for fault-tolerant workloads.

Reference

Scaling Approach Comparison

Aspect	Vertical Scaling	Horizontal Scaling
Method	Increase machine resources	Add more machines
Maximum capacity	Hardware limits	Effectively unlimited
Fault tolerance	Single point of failure	Redundancy through multiple nodes
Complexity	Simpler architecture	Requires distributed system patterns
Cost at small scale	Lower	Higher due to overhead
Cost at large scale	Non-linear increase	Linear scaling
Downtime during scaling	Brief interruption	Zero downtime
State management	Local state supported	Requires distributed state
Development effort	Minimal code changes	Significant architectural changes
Performance per request	Higher single-request speed	Unchanged per request
Aggregate throughput	Limited by single machine	Scales with node count

Ruby Application Server Scaling

Configuration	Vertical Approach	Horizontal Approach
Puma workers	Increase count to match cores	Fixed count per instance
Thread pools	Increase for I/O bound	Balance threads and workers
Database connections	Scale pool with threads	Distribute across instances
Session storage	Memory sessions acceptable	Require external store
File uploads	Local filesystem works	Need shared storage
Background jobs	Increase concurrency	Add worker instances
Caching	Local memory cache	Distributed cache cluster
Deployment	Single instance update	Rolling updates across instances

Database Scaling Strategies

Strategy	Type	Use Case	Complexity
Larger instance	Vertical	Single database workload	Low
Read replicas	Horizontal	Read-heavy workloads	Medium
Connection pooling	Optimization	High connection counts	Low
Sharding	Horizontal	Very large datasets	High
Partitioning	Hybrid	Time-series data	Medium
Caching layer	Horizontal	Repeated queries	Medium
Multi-region	Horizontal	Global distribution	High

Scaling Decision Matrix

Workload Characteristic	Recommended Approach	Reason
CPU intensive	Vertical	Faster processors improve performance
I/O bound	Horizontal	Parallel I/O operations increase throughput
Memory intensive	Vertical	Unified memory space simplifies access
Network bound	Horizontal	Distribute connection handling
Strong consistency required	Vertical	Avoid distributed coordination overhead
High availability required	Horizontal	Redundancy prevents single points of failure
Unpredictable traffic	Horizontal	Dynamic scaling responds to demand
Legacy application	Vertical	Avoid architectural changes
Stateful application	Vertical	Avoid distributed state complexity
Stateless application	Horizontal	Easy parallel scaling

Common Scaling Metrics

Metric	Description	Scaling Trigger
CPU utilization	Processor load percentage	Above 70% sustained
Memory utilization	RAM usage percentage	Above 80% sustained
Request queue depth	Pending requests waiting	Above 10 sustained
Response time	Request processing duration	Above target latency
Error rate	Failed request percentage	Above 1%
Database connections	Active connection count	Above 80% of pool
Cache hit rate	Cache successful lookups	Below 90%
Disk I/O wait	Time waiting for disk	Above 20%
Network throughput	Data transfer rate	Above 80% of capacity

Ruby Scaling Configuration Examples

Component	Configuration Parameter	Vertical Setting	Horizontal Setting
Puma workers	WEB_CONCURRENCY	16 workers	4 workers per instance
Puma threads	RAILS_MAX_THREADS	10 threads	5 threads per worker
Database pool	pool size	160 connections	20 per instance
Sidekiq concurrency	concurrency value	50 threads	10 per instance
Redis connections	connection pool	200 connections	25 per instance
Nginx workers	worker_processes	16 workers	auto per instance

Horizontal vs Vertical Scaling