Overview
Scaling refers to the process of increasing system capacity to handle growing workloads. Two fundamental approaches exist: vertical scaling (scaling up) and horizontal scaling (scaling out). Vertical scaling increases the resources of a single machine—adding more CPU cores, RAM, or faster storage. Horizontal scaling adds more machines to distribute the workload across multiple nodes.
The distinction matters because each approach carries different implications for cost, complexity, reliability, and maximum capacity. A vertically scaled system runs on progressively larger machines, while a horizontally scaled system distributes work across many smaller machines. This architectural choice affects database design, session management, deployment strategies, and operational procedures.
Vertical scaling dominated early computing when distributed systems presented significant technical challenges. As cloud computing matured and distributed system patterns became standardized, horizontal scaling gained favor for its ability to scale beyond single-machine limits and provide fault tolerance through redundancy.
Most production systems use both approaches at different layers. A database might scale vertically to a certain point before implementing horizontal sharding, while application servers scale horizontally from the start. The choice depends on the specific component being scaled and the trade-offs acceptable for that layer.
Key Principles
Vertical scaling increases the capacity of individual machines by upgrading hardware components. This involves replacing existing CPUs with faster models, adding RAM modules, upgrading to NVMe storage, or moving to machines with higher specifications. The process requires downtime during hardware replacement, though cloud providers offer instance resizing with brief interruptions.
The vertical scaling approach maintains a single logical server from the application's perspective. Code runs on one machine with direct access to local memory and storage. Inter-process communication happens through local mechanisms like Unix sockets or shared memory. Database connections pool to a single instance, and sessions persist in local memory without coordination overhead.
Vertical scaling hits physical limits. Individual machines max out at hardware boundaries—the largest available CPUs, maximum addressable RAM, or fastest available storage. Cloud providers offer instances up to hundreds of cores and terabytes of memory, but these limits exist and costs increase non-linearly at the high end.
Horizontal scaling distributes workload across multiple machines. Each node runs a complete copy of the application, and load balancers distribute incoming requests. State management becomes distributed—sessions stored in Redis or databases, file uploads sent to object storage, and database queries routed to read replicas or sharded clusters.
Adding capacity horizontally involves provisioning new identical machines and registering them with load balancers. Most cloud platforms automate this through auto-scaling groups that monitor metrics and adjust capacity. The process happens without downtime since new nodes join while existing ones continue serving traffic.
Horizontal scaling introduces distributed systems complexity. Network calls replace local operations, creating latency and failure modes. Clock synchronization affects ordering guarantees. Distributed transactions require coordination protocols. Cache invalidation spreads across nodes. These challenges require architectural patterns like eventual consistency, idempotency, and circuit breakers.
The CAP theorem constrains horizontally scaled systems: achieving consistency, availability, and partition tolerance simultaneously proves impossible. Systems choose between strong consistency (CP) or high availability (AP) during network partitions. Vertically scaled single-node systems avoid this trade-off but sacrifice the redundancy that prevents single points of failure.
Stateless design enables horizontal scaling. Applications that store no local state can route requests to any node. Stateful components require session stickiness, data replication, or distributed state stores. Ruby applications typically store sessions in external systems, upload files to shared storage, and maintain database connections to clustered databases.
Design Considerations
Vertical scaling makes sense for databases requiring strong consistency. PostgreSQL and MySQL perform best on powerful single machines with fast storage and large RAM for caching. Database operations benefit from local disk access and in-memory operations without network overhead. Scaling to machines with 96 cores and 768GB RAM handles substantial workloads before requiring horizontal scaling through replication or sharding.
Applications with significant state benefit from vertical scaling when coordination costs outweigh hardware costs. Real-time gaming servers maintaining player positions and physics simulations run more efficiently on powerful machines than distributed across nodes requiring constant synchronization. The programming model remains simpler without distributed state management.
Legacy applications lacking stateless design often scale vertically more easily than refactoring for horizontal distribution. Migrating session state, file storage, and local caching to external systems requires architectural changes. Vertical scaling postpones this complexity, though it sets an eventual ceiling on capacity.
Cost considerations favor vertical scaling at small scales and horizontal scaling at large scales. Running one 16-core machine costs less than eight 2-core machines due to operational overhead and licensing. At hundreds or thousands of cores, commodity hardware deployed horizontally costs less than fewer massive machines. Cloud pricing reflects this with premiums on largest instances.
Horizontal scaling provides fault tolerance through redundancy. Multiple application servers mean individual failures don't cause outages. Load balancers detect unhealthy nodes and route traffic elsewhere. Database replicas take over when primaries fail. Vertical scaling creates single points of failure—when the machine fails, the service fails unless cold standby machines wait ready.
Development velocity considerations differ between approaches. Vertical scaling requires less code change—applications continue treating resources as local. Horizontal scaling demands architectural patterns for distributed systems, session management, and state coordination. Teams experienced with distributed systems implement horizontal scaling efficiently, while teams new to these patterns face steeper learning curves.
Performance characteristics vary by workload type. CPU-bound tasks benefit from faster processors available through vertical scaling. I/O-bound tasks benefit from parallelism across horizontally scaled nodes. Memory-intensive operations favor vertical scaling's unified memory space over distributed caching. Network-bound services gain from horizontal scaling's ability to distribute connection handling.
Deployment complexity increases with horizontal scaling. Single machines require straightforward deployment processes. Distributed systems need orchestration, health checks, rolling updates, and service discovery. Container orchestration platforms like Kubernetes manage this complexity but introduce operational overhead. Vertical scaling maintains simpler deployment at the cost of more disruptive updates.
Implementation Approaches
Cloud platform scaling uses provider-specific tools and services. AWS offers EC2 Auto Scaling Groups that maintain desired instance counts, scale based on CloudWatch metrics, and integrate with Elastic Load Balancing. Instances launch from AMIs containing configured application code. Target tracking policies adjust capacity to maintain metrics like CPU utilization at specified levels.
Container orchestration platforms abstract scaling across cloud providers or on-premises infrastructure. Kubernetes Horizontal Pod Autoscaler monitors metrics and adjusts replica counts. Deployments define desired state, and controllers maintain that state by creating or terminating pods. Services provide stable endpoints backed by changing pod sets. Cluster autoscalers add nodes when pods remain unschedulable.
Database scaling strategies separate read and write workloads. Primary databases handle writes and maintain authoritative state. Read replicas asynchronously copy data and serve read queries. Applications route writes to primaries and reads to replicas. Replica lag introduces eventual consistency—reads might not reflect recent writes. Connection pools distribute queries across replicas for load distribution.
Database sharding horizontally partitions data across multiple database instances. Each shard contains a subset of total data, determined by sharding key selection. Applications route queries to appropriate shards based on keys. Range-based sharding uses key ranges (users A-M, N-Z). Hash-based sharding applies hash functions to distribute data evenly. Cross-shard queries become complex or impossible.
Caching layers scale horizontally to reduce database load. Distributed caches like Redis Cluster or Memcached spread cached data across nodes. Cache keys hash to specific nodes for consistent routing. Applications check caches before database queries. Cache invalidation requires careful coordination to prevent stale data. Cache miss thundering herds need protection through request coalescing or probabilistic early expiration.
Message queue scaling distributes asynchronous workloads. Queue services like RabbitMQ or Amazon SQS buffer requests for background processing. Worker processes scale horizontally to consume messages in parallel. Each worker polls for messages, processes them, and acknowledges completion. Failed messages retry or move to dead letter queues. Queue depth metrics drive autoscaling decisions.
Session management strategies enable stateless application scaling. External session stores like Redis maintain session data accessible from any application node. Encrypted session cookies store small amounts of session data client-side. Sticky sessions route user requests to consistent nodes but reduce load distribution effectiveness and complicate rolling updates.
Static asset scaling separates file serving from application logic. Content delivery networks cache and serve static files from edge locations near users. Object storage services like Amazon S3 provide durable storage for uploads. Applications generate signed URLs for direct client access to objects. CloudFront distributions cache objects and serve them from edge locations with low latency.
Ruby Implementation
Ruby web applications scale horizontally through application server processes. Puma, the default Rails server, runs multiple worker processes, each handling concurrent requests through threads. Configuration specifies worker count and thread pools:
# config/puma.rb
workers ENV.fetch("WEB_CONCURRENCY") { 4 }
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count
preload_app!
on_worker_boot do
ActiveRecord::Base.establish_connection
end
Each Puma worker forks from a master process after code loading, sharing memory through copy-on-write. Thread pools within workers handle concurrent requests. The configuration scales vertically through more threads per worker and horizontally through more worker processes across machines.
Database connection pools manage concurrent query execution. ActiveRecord establishes connection pools sized by thread count. Each thread checks out connections for queries, returning them to the pool afterward:
# config/database.yml
production:
adapter: postgresql
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
timeout: 5000
# Database URL includes pool size
DATABASE_URL=postgres://user:pass@host/db?pool=20
Connection pool exhaustion occurs when all connections are in use. Requests wait for available connections until timeout expires. Horizontally scaling application servers requires scaling database connections proportionally. A database supporting 100 connections across 10 application servers needs 10 connections per server.
Session stores determine scaling architecture. Cookie-based sessions maintain state client-side, enabling fully stateless scaling:
# config/initializers/session_store.rb
Rails.application.config.session_store :cookie_store,
key: '_app_session',
secure: Rails.env.production?,
same_site: :lax
Redis session stores enable server-side session state accessible from any application node:
# Gemfile
gem 'redis-rails'
# config/initializers/session_store.rb
Rails.application.config.session_store :redis_store,
servers: ["redis://localhost:6379/0/session"],
expire_after: 90.minutes,
key: "_app_session"
Background job processing scales through worker processes consuming from shared queues. Sidekiq uses Redis-backed queues and runs concurrent jobs through threads:
# config/sidekiq.yml
:concurrency: 10
:queues:
- critical
- default
- low_priority
# Worker class
class ReportGenerator
include Sidekiq::Worker
sidekiq_options queue: :default, retry: 3
def perform(user_id)
user = User.find(user_id)
report = user.generate_report
ReportMailer.send_report(report).deliver_now
end
end
Scaling background processing horizontally involves running multiple Sidekiq processes across machines. Each process polls Redis queues for pending jobs. Job distribution happens automatically through Redis's blocking pop operations. Failed jobs retry according to configured strategies.
Cache scaling distributes cached data across Redis nodes. Redis Cluster shards data based on hash slots:
# config/initializers/redis.rb
redis_config = {
cluster: [
"redis://node1:6379",
"redis://node2:6379",
"redis://node3:6379"
],
timeout: 1,
reconnect_attempts: 3
}
$redis = Redis.new(redis_config)
# Caching in application code
def expensive_calculation(id)
cache_key = "calculation:#{id}"
$redis.get(cache_key) || begin
result = perform_calculation(id)
$redis.setex(cache_key, 3600, result)
result
end
end
File upload handling requires shared storage for horizontal scaling. Direct uploads to S3 remove upload processing from application servers:
# app/models/document.rb
class Document < ApplicationRecord
has_one_attached :file
end
# config/storage.yml
amazon:
service: S3
access_key_id: <%= ENV['AWS_ACCESS_KEY_ID'] %>
secret_access_key: <%= ENV['AWS_SECRET_ACCESS_KEY'] %>
region: us-east-1
bucket: app-uploads
# config/environments/production.rb
config.active_storage.service = :amazon
ActiveStorage generates presigned URLs for direct client uploads, bypassing application servers entirely. This architecture scales horizontally without shared filesystem requirements.
Practical Examples
An e-commerce platform starts on a single server handling 1,000 daily orders. As traffic grows to 10,000 daily orders, response times degrade. The initial vertical scaling approach upgrades from a 2-core, 4GB instance to an 8-core, 32GB instance. Database queries speed up with more memory for caching. Application requests process faster with additional cores. This scaling handles growth for several months.
At 50,000 daily orders, the single server reaches limits. Peak traffic causes request queuing and timeouts. The team implements horizontal scaling by deploying three application servers behind an Nginx load balancer:
# nginx.conf upstream configuration
upstream app_servers {
server app1.internal:3000;
server app2.internal:3000;
server app3.internal:3000;
}
server {
listen 80;
location / {
proxy_pass http://app_servers;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Sessions move from cookie storage to Redis for cross-server accessibility. File uploads transition to S3. The database remains vertically scaled on a powerful instance. This architecture handles 200,000 daily orders across horizontally scaled application servers.
A real-time analytics API processes data streams and serves query results. Initial vertical scaling on a 16-core machine handles 100 requests per second. Query processing involves complex aggregations over large datasets. Memory caching speeds frequent queries, and the powerful CPU processes aggregations quickly.
Growth to 500 requests per second requires horizontal scaling. The team shards analytics data by date ranges—recent data on hot shards, historical data on cold storage. Query routing logic directs requests to appropriate shards:
class AnalyticsQuery
def self.fetch(start_date, end_date)
shards = determine_shards(start_date, end_date)
results = shards.map do |shard|
Thread.new do
shard.query(start_date, end_date)
end
end.map(&:value)
merge_results(results)
end
private
def self.determine_shards(start_date, end_date)
date_range = start_date..end_date
SHARD_MAPPING.select { |range, _| range.overlaps?(date_range) }
.values
end
end
Read replicas distribute query load. Write operations go to the primary database, reads distribute across replicas. Replica lag monitoring ensures queries hit sufficiently up-to-date replicas for application requirements.
A SaaS application processes background jobs for report generation. Initial vertical scaling uses one machine with 24 cores running Sidekiq with high concurrency. Each report generation takes 30 seconds of CPU time. The single machine processes 2,880 reports per hour at full utilization.
Customer growth requires processing 10,000 reports per hour. Horizontal scaling deploys 10 worker machines, each running Sidekiq with 10 concurrent threads. The 100 total workers process reports in parallel, achieving the required throughput:
# kubernetes deployment for workers
apiVersion: apps/v1
kind: Deployment
metadata:
name: sidekiq-workers
spec:
replicas: 10
template:
spec:
containers:
- name: sidekiq
image: app:latest
command: ["bundle", "exec", "sidekiq"]
env:
- name: REDIS_URL
value: "redis://redis-cluster:6379"
- name: RAILS_MAX_THREADS
value: "10"
resources:
requests:
cpu: "2"
memory: "4Gi"
Auto-scaling policies monitor queue depth and adjust worker count dynamically. During off-peak hours, worker count drops to 3. During peak processing, workers scale to 15. This approach optimizes cost while meeting processing requirements.
Performance Considerations
Vertical scaling improves single-request performance through faster processors and more memory. Database queries benefit from CPU cache locality and high memory bandwidth. Application code executes faster on processors with higher clock speeds and better instruction-per-cycle performance. Large memory capacities cache entire working sets, eliminating disk I/O.
Horizontal scaling improves throughput through parallel request handling but doesn't accelerate individual requests. Each request still processes at single-node speed. Response time improvements come from reduced queuing when load distributes across nodes. The aggregate throughput grows linearly with node count for stateless workloads.
Network latency affects horizontally scaled systems. Remote cache hits add 1-2ms versus microseconds for local memory. Database queries across networks add roundtrip latency. Service mesh calls between microservices compound latency through multiple hops. Careful architecture minimizes remote calls in hot paths.
Connection pool sizing determines throughput limits. Application threads require database connections for query execution. Too few connections create bottlenecks; too many overwhelm databases. The formula connections = threads × workers × instances must stay below database limits:
# Calculate total connections
app_servers = 5
puma_workers = 4
puma_threads = 5
total_connections = app_servers * puma_workers * puma_threads
# => 100 connections needed
# Database must support 100+ connections
# Add buffer for background jobs and migrations
required_capacity = total_connections * 1.2
# => 120 connections recommended
Cache hit rates determine scaling effectiveness. High cache hit rates reduce database load, allowing more traffic per database instance. Low hit rates cause cache overhead without benefit. Cache warming strategies preload frequently accessed data. Cache eviction policies like LRU balance hit rates against memory constraints.
Query optimization reduces database load more effectively than scaling. Proper indexing, query planning, and schema design prevent scaling requirements. An unoptimized query forcing horizontal scaling might run efficiently on a vertically scaled database with correct indexes. Profile queries before scaling.
Serialization overhead affects distributed cache performance. Ruby's Marshal.dump and Marshal.load consume CPU cycles. JSON serialization offers cross-language compatibility at performance cost. MessagePack provides faster serialization:
# Compare serialization performance
data = { users: User.limit(100).map(&:attributes) }
# Marshal (Ruby-specific, fast)
marshaled = Marshal.dump(data)
Marshal.load(marshaled)
# JSON (portable, slower)
json = JSON.generate(data)
JSON.parse(json)
# MessagePack (portable, fast)
packed = MessagePack.pack(data)
MessagePack.unpack(packed)
Load balancer algorithms affect request distribution. Round-robin distributes requests evenly but ignores node capacity differences. Least connections route to nodes handling fewer requests, adapting to varying request processing times. IP hash provides session affinity without sticky sessions but creates uneven distribution.
Auto-scaling lag introduces temporary capacity gaps. Metrics collection, threshold evaluation, and instance provisioning take minutes. During traffic spikes, existing instances handle excess load before new capacity comes online. Pre-scaling based on predictable patterns prevents this lag. Scheduled scaling adds capacity before expected load increases.
Real-World Applications
Web application deployments typically horizontally scale application tiers while vertically scaling databases. Ruby on Rails applications run across multiple instances in AWS Auto Scaling Groups behind Application Load Balancers. Each instance runs Puma with 4 workers and 5 threads, sized for 2-core instances. Traffic increases trigger new instance launches within 2-3 minutes.
Database primary instances scale vertically up to db.r5.24xlarge with 96 cores and 768GB RAM. Read replicas horizontally scale to distribute read traffic. Applications route write operations to primaries through writer endpoints, and read operations to reader endpoints backed by replica pools. Replica lag monitoring redirects reads requiring fresh data to primaries.
Background processing systems scale worker pools based on queue depth. Sidekiq workers run in separate deployment groups from web servers. Queue depth metrics trigger scaling—when queues exceed 1,000 jobs, additional workers launch. When queues drain below 100 jobs, workers terminate. This dynamic scaling optimizes processing costs.
Caching layers use Redis Cluster for horizontal scaling. Cache data shards across nodes based on key hashing. Applications connect to cluster endpoints, and Redis automatically routes requests to correct nodes. Cluster topology changes transparently to applications. Memory usage metrics trigger node additions to the cluster.
API gateways aggregate backend microservices. Each service scales independently based on its traffic patterns. User services might scale to 20 instances while payment services run 5 instances. API Gateway routes requests to appropriate service pools. Circuit breakers prevent cascading failures when services experience issues.
Monitoring systems track per-instance and aggregate metrics. Instance metrics identify individual node issues. Aggregate metrics show overall system capacity. CloudWatch dashboards display request rates, error rates, response times, and resource utilization. Alarms trigger when metrics exceed thresholds, indicating scaling needs.
Database connection management requires careful planning. Connection pooling configuration must account for total application instances. A database supporting 500 connections with 100 application instances means 5 connections per instance. PgBouncer provides connection pooling layers that reduce database connection requirements:
# pgbouncer.ini configuration
[databases]
production = host=primary-db.internal dbname=app
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
# Application connects to PgBouncer
DATABASE_URL=postgres://user:pass@pgbouncer:6432/production
Blue-green deployments minimize downtime during scaling changes. New application versions deploy to separate infrastructure ("green") while current versions run on existing infrastructure ("blue"). Traffic switches to green after health checks pass. This approach enables architecture changes, including scaling strategy modifications, without service interruptions.
Cost optimization balances performance and expense. Horizontally scaled systems use smaller instances with lower per-hour costs but higher total costs for equivalent capacity due to overhead. Vertically scaled systems use fewer, larger instances with premium pricing but lower management overhead. Spot instances reduce horizontal scaling costs for fault-tolerant workloads.
Reference
Scaling Approach Comparison
| Aspect | Vertical Scaling | Horizontal Scaling |
|---|---|---|
| Method | Increase machine resources | Add more machines |
| Maximum capacity | Hardware limits | Effectively unlimited |
| Fault tolerance | Single point of failure | Redundancy through multiple nodes |
| Complexity | Simpler architecture | Requires distributed system patterns |
| Cost at small scale | Lower | Higher due to overhead |
| Cost at large scale | Non-linear increase | Linear scaling |
| Downtime during scaling | Brief interruption | Zero downtime |
| State management | Local state supported | Requires distributed state |
| Development effort | Minimal code changes | Significant architectural changes |
| Performance per request | Higher single-request speed | Unchanged per request |
| Aggregate throughput | Limited by single machine | Scales with node count |
Ruby Application Server Scaling
| Configuration | Vertical Approach | Horizontal Approach |
|---|---|---|
| Puma workers | Increase count to match cores | Fixed count per instance |
| Thread pools | Increase for I/O bound | Balance threads and workers |
| Database connections | Scale pool with threads | Distribute across instances |
| Session storage | Memory sessions acceptable | Require external store |
| File uploads | Local filesystem works | Need shared storage |
| Background jobs | Increase concurrency | Add worker instances |
| Caching | Local memory cache | Distributed cache cluster |
| Deployment | Single instance update | Rolling updates across instances |
Database Scaling Strategies
| Strategy | Type | Use Case | Complexity |
|---|---|---|---|
| Larger instance | Vertical | Single database workload | Low |
| Read replicas | Horizontal | Read-heavy workloads | Medium |
| Connection pooling | Optimization | High connection counts | Low |
| Sharding | Horizontal | Very large datasets | High |
| Partitioning | Hybrid | Time-series data | Medium |
| Caching layer | Horizontal | Repeated queries | Medium |
| Multi-region | Horizontal | Global distribution | High |
Scaling Decision Matrix
| Workload Characteristic | Recommended Approach | Reason |
|---|---|---|
| CPU intensive | Vertical | Faster processors improve performance |
| I/O bound | Horizontal | Parallel I/O operations increase throughput |
| Memory intensive | Vertical | Unified memory space simplifies access |
| Network bound | Horizontal | Distribute connection handling |
| Strong consistency required | Vertical | Avoid distributed coordination overhead |
| High availability required | Horizontal | Redundancy prevents single points of failure |
| Unpredictable traffic | Horizontal | Dynamic scaling responds to demand |
| Legacy application | Vertical | Avoid architectural changes |
| Stateful application | Vertical | Avoid distributed state complexity |
| Stateless application | Horizontal | Easy parallel scaling |
Common Scaling Metrics
| Metric | Description | Scaling Trigger |
|---|---|---|
| CPU utilization | Processor load percentage | Above 70% sustained |
| Memory utilization | RAM usage percentage | Above 80% sustained |
| Request queue depth | Pending requests waiting | Above 10 sustained |
| Response time | Request processing duration | Above target latency |
| Error rate | Failed request percentage | Above 1% |
| Database connections | Active connection count | Above 80% of pool |
| Cache hit rate | Cache successful lookups | Below 90% |
| Disk I/O wait | Time waiting for disk | Above 20% |
| Network throughput | Data transfer rate | Above 80% of capacity |
Ruby Scaling Configuration Examples
| Component | Configuration Parameter | Vertical Setting | Horizontal Setting |
|---|---|---|---|
| Puma workers | WEB_CONCURRENCY | 16 workers | 4 workers per instance |
| Puma threads | RAILS_MAX_THREADS | 10 threads | 5 threads per worker |
| Database pool | pool size | 160 connections | 20 per instance |
| Sidekiq concurrency | concurrency value | 50 threads | 10 per instance |
| Redis connections | connection pool | 200 connections | 25 per instance |
| Nginx workers | worker_processes | 16 workers | auto per instance |