CrackedRuby CrackedRuby

Overview

Rolling updates represent a deployment strategy where application instances transition from an old version to a new version incrementally rather than all at once. The process replaces a subset of running instances with updated versions, validates their health, and continues the pattern until all instances run the new version. This approach maintains service availability throughout the deployment cycle.

The strategy emerged from the need to deploy changes to production systems without service interruption. Traditional deployment methods required taking down all instances, deploying new code, and restarting services—an approach that caused downtime and disrupted users. Rolling updates eliminate this downtime by keeping a portion of the system operational during deployment.

The mechanism operates through controlled instance replacement. In a system with multiple application instances, the deployment process stops a small number of instances, deploys new code to those instances, verifies they function correctly, and then proceeds to the next batch. Traffic routing ensures requests only reach healthy instances capable of serving them.

Rolling updates differ fundamentally from other deployment strategies. A recreate deployment terminates all instances before starting new ones, accepting downtime as a trade-off for simplicity. Blue-green deployments maintain two complete environments and switch traffic between them, requiring double the resources but enabling instant rollback. Canary deployments route a small percentage of traffic to new versions while monitoring for issues before proceeding, whereas rolling updates replace instances systematically regardless of traffic distribution.

# Conceptual flow of a rolling update
# Initial state: 4 instances running v1
instances = [
  { id: 1, version: 'v1', status: 'running' },
  { id: 2, version: 'v1', status: 'running' },
  { id: 3, version: 'v1', status: 'running' },
  { id: 4, version: 'v1', status: 'running' }
]

# Update 2 instances at a time
batch_size = 2
instances[0..1].each do |instance|
  instance[:status] = 'stopped'
  deploy_new_version(instance)
  instance[:version] = 'v2'
  wait_for_health_check(instance)
  instance[:status] = 'running'
end
# State: instances 1,2 on v2; instances 3,4 on v1

# Update remaining instances
instances[2..3].each do |instance|
  instance[:status] = 'stopped'
  deploy_new_version(instance)
  instance[:version] = 'v2'
  wait_for_health_check(instance)
  instance[:status] = 'running'
end
# Final state: all instances on v2

The strategy requires infrastructure that supports running multiple instances and dynamically routing traffic. Load balancers direct requests to healthy instances while unhealthy or updating instances receive no traffic. Health checks monitor instance status and determine readiness to serve traffic. Service discovery mechanisms track available instances and update routing tables accordingly.

Key Principles

Rolling updates operate on several core principles that govern successful deployments. Understanding these principles helps design robust deployment processes and avoid common failure modes.

Gradual Instance Replacement forms the foundation of rolling updates. The deployment process divides instances into batches and updates each batch sequentially. Batch size determines deployment speed and risk exposure. Smaller batches reduce risk by limiting the number of instances running untested code, but increase deployment duration. Larger batches accelerate deployment but expose more capacity to potential issues. The minimum number of instances must remain available to handle traffic load during updates.

Health Verification ensures new instances function correctly before proceeding. After updating an instance, the system must verify it can serve traffic successfully. Health checks perform this verification through various mechanisms. Readiness probes test whether an instance accepts connections and responds appropriately. Liveness probes detect when an instance has failed and needs replacement. Startup probes give instances time to initialize before declaring them unhealthy. The deployment process waits for positive health signals before updating additional instances.

# Health check implementation
class HealthChecker
  def initialize(instance_url, timeout: 30)
    @instance_url = instance_url
    @timeout = timeout
  end

  def ready?
    start_time = Time.now
    while Time.now - start_time < @timeout
      return true if check_readiness
      sleep 2
    end
    false
  end

  def check_readiness
    response = HTTP.get("#{@instance_url}/health")
    response.status == 200 && response.parse['status'] == 'ready'
  rescue => e
    false
  end
end

Traffic Management controls how requests route to instances during deployment. Load balancers maintain a pool of healthy instances and distribute traffic among them. When the deployment process updates an instance, it removes that instance from the load balancer pool, preventing new connections. Existing connections may drain gracefully or terminate immediately depending on configuration. After the instance updates and passes health checks, the load balancer adds it back to the pool.

State Synchronization addresses the challenge of running multiple versions simultaneously. During a rolling update, old and new versions coexist, potentially with different data schemas, API contracts, or business logic. The application must handle this temporary inconsistency. Database migrations must support both old and new code. API changes must maintain backward compatibility or version endpoints. Session data must remain valid across versions.

Rollback Capability provides recovery from failed deployments. The system must detect deployment failures through health check failures, error rate increases, or performance degradation. Detection triggers a rollback process that reverts instances to the previous version. Automated rollback responds faster than manual intervention but requires careful configuration to avoid false positives. The deployment system must preserve the previous version to enable rollback.

# Rollback detection and execution
class RollbackMonitor
  def initialize(deployment, error_threshold: 0.05)
    @deployment = deployment
    @error_threshold = error_threshold
  end

  def monitor
    updated_instances = @deployment.updated_instances
    error_rate = calculate_error_rate(updated_instances)
    
    if error_rate > @error_threshold
      trigger_rollback
    end
  end

  def calculate_error_rate(instances)
    total_requests = instances.sum { |i| i.request_count }
    error_requests = instances.sum { |i| i.error_count }
    error_requests.to_f / total_requests
  end

  def trigger_rollback
    @deployment.halt
    @deployment.updated_instances.each do |instance|
      instance.deploy(@deployment.previous_version)
      instance.restart
    end
  end
end

Version Compatibility ensures old and new versions operate together correctly during the transition period. The deployment process creates a mixed-version environment where some instances run the old code and others run the new code. Both versions must handle requests correctly and interact with shared resources without conflicts. Forward compatibility allows old code to work with data created by new code. Backward compatibility allows new code to work with data created by old code.

Capacity Management maintains sufficient resources during deployment. Updating instances reduces available capacity since those instances cannot serve traffic. The remaining instances must handle the full load. The batch size must leave enough instances running to meet performance requirements. The deployment process may pause if system metrics indicate resource constraints. Some deployments increase total capacity temporarily to maintain performance during the update.

Implementation Approaches

Rolling update implementations vary based on infrastructure, application architecture, and operational requirements. Different approaches offer distinct trade-offs in complexity, resource usage, and deployment safety.

Basic Rolling Update replaces instances sequentially in a single group. The process stops an instance, deploys new code, starts the instance, verifies health, and repeats until all instances run the new version. This approach requires minimal infrastructure but offers limited control over deployment speed and risk exposure.

# Basic rolling update implementation
class BasicRollingUpdate
  def initialize(instances, new_version)
    @instances = instances
    @new_version = new_version
    @batch_size = calculate_batch_size
  end

  def execute
    @instances.each_slice(@batch_size) do |batch|
      update_batch(batch)
      verify_health(batch)
    end
  end

  def calculate_batch_size
    # Update 25% of instances at a time, minimum 1
    [(@instances.size * 0.25).ceil, 1].max
  end

  def update_batch(batch)
    batch.each do |instance|
      load_balancer.remove(instance)
      instance.stop
      instance.deploy(@new_version)
      instance.start
    end
  end

  def verify_health(batch)
    batch.each do |instance|
      raise DeploymentError unless health_checker.ready?(instance)
      load_balancer.add(instance)
    end
  end
end

Phased Rollout divides instances into multiple groups with pauses between groups. The deployment updates the first group, pauses for observation, then proceeds to subsequent groups if no issues appear. This approach increases deployment duration but provides opportunities to detect problems before they affect all instances. The pause duration depends on how quickly problems manifest and monitoring capabilities.

Surge Deployment creates new instances with the updated version before removing old instances. The process adds new instances to the pool, waits for them to become healthy, and then removes old instances. This approach temporarily increases resource usage but maintains full capacity throughout deployment and provides a faster rollback path—simply remove the new instances and keep the old ones running.

# Surge deployment strategy
class SurgeDeployment
  def initialize(instances, new_version, surge_count: nil)
    @instances = instances
    @new_version = new_version
    # Create surge_count additional instances, default to 50% of current
    @surge_count = surge_count || (@instances.size * 0.5).ceil
  end

  def execute
    # Create new instances
    new_instances = create_surge_instances
    wait_for_health(new_instances)
    add_to_load_balancer(new_instances)
    
    # Remove old instances
    @instances.each_slice(calculate_batch_size) do |batch|
      remove_from_load_balancer(batch)
      batch.each(&:terminate)
    end
  end

  def create_surge_instances
    @surge_count.times.map do
      Instance.create(version: @new_version)
    end
  end
end

Canary Rolling Update combines canary deployment principles with rolling updates. The process first deploys to a small canary group, monitors metrics closely, and proceeds with a full rolling update only if the canary succeeds. This approach catches issues early while maintaining the gradual instance replacement of rolling updates.

Zone-Aware Rolling Update coordinates updates across availability zones or regions to maintain geographic redundancy. The deployment updates instances in one zone completely before moving to the next zone, ensuring at least one zone always runs healthy instances. This approach prevents correlated failures across zones and maintains disaster recovery capabilities during deployment.

MaxUnavailable and MaxSurge Configuration provides fine-grained control over deployment behavior. MaxUnavailable specifies the maximum number or percentage of instances that can be unavailable during deployment. MaxSurge specifies the maximum number or percentage of instances that can exist above the desired count during deployment. These parameters control deployment speed, resource usage, and availability guarantees.

# Configurable rolling update with maxUnavailable and maxSurge
class ConfigurableRollingUpdate
  def initialize(instances, new_version, config)
    @instances = instances
    @new_version = new_version
    @max_unavailable = calculate_max_unavailable(config[:max_unavailable])
    @max_surge = calculate_max_surge(config[:max_surge])
  end

  def execute
    if @max_surge > 0
      surge_deployment
    else
      in_place_deployment
    end
  end

  def calculate_max_unavailable(value)
    case value
    when Integer then value
    when String then (@instances.size * parse_percentage(value)).floor
    else (@instances.size * 0.25).floor # default 25%
    end
  end

  def calculate_max_surge(value)
    case value
    when Integer then value
    when String then (@instances.size * parse_percentage(value)).ceil
    else 0 # default to in-place updates
    end
  end

  def parse_percentage(str)
    str.delete('%').to_f / 100
  end
end

Database-Aware Deployment coordinates application updates with database schema changes. The process follows a specific sequence: deploy backward-compatible schema changes, deploy new application code that works with both old and new schemas, complete the rolling update, and finally deploy cleanup migrations that remove old schema elements. This multi-phase approach prevents application errors during the mixed-version period.

Design Considerations

Selecting rolling updates over alternative deployment strategies requires evaluating application characteristics, infrastructure capabilities, and operational requirements. The decision impacts deployment safety, complexity, resource usage, and recovery procedures.

Stateful vs Stateless Applications significantly affects rolling update suitability. Stateless applications handle rolling updates naturally since instances can start and stop without data loss. Each request contains all necessary information, and load balancers can route requests to any healthy instance. Stateful applications require additional consideration for session data, in-memory caches, and persistent connections. Session affinity (sticky sessions) may be necessary to route users consistently to the same instance during updates. Alternatively, externalizing session state to a shared data store eliminates instance-specific state.

Application Startup Time influences batch size and deployment duration. Applications with fast startup times (seconds) can use smaller batches and complete deployments quickly. Applications with slow startup times (minutes) benefit from larger batches to reduce total deployment time. Extremely slow startup times may make rolling updates impractical—the deployment could take hours, increasing the duration of the mixed-version state and delaying rollback if needed.

Resource Availability determines whether surge deployment is feasible. Surge deployments require temporary excess capacity, potentially doubling resource usage during deployment. Organizations with constrained resources or high infrastructure costs may prefer in-place updates that modify existing instances. Cloud environments with elastic capacity can provision surge instances easily and deprovision them after deployment completes.

# Decision framework for deployment strategy selection
class DeploymentStrategySelector
  def initialize(application)
    @application = application
  end

  def recommend_strategy
    if @application.stateful? && !@application.externalized_state?
      return :blue_green # Avoid mixed versions with unmanaged state
    end

    if @application.startup_time > 300 # 5 minutes
      return :blue_green # Avoid long deployments
    end

    if @application.critical_availability? && @application.surge_capacity_available?
      return :surge_rolling # Maintain full capacity
    end

    if @application.resource_constrained?
      return :basic_rolling # Minimize resource usage
    end

    :canary_rolling # Default to safest progressive deployment
  end
end

Breaking Changes determine deployment complexity. Backward-compatible changes deploy safely with rolling updates since both versions function correctly. Breaking changes—modified APIs, incompatible data formats, changed business logic—require careful planning. The deployment may need multiple phases: deploy a backward-compatible intermediate version, migrate data, deploy the version with breaking changes, and finally remove compatibility shims. Alternatively, API versioning allows old and new versions to coexist indefinitely.

Rollback Requirements influence deployment strategy choice. Rolling updates provide gradual rollback by reversing the update process, but rollback takes as long as the original deployment. Applications requiring instant rollback may prefer blue-green deployments that switch traffic back to the old environment immediately. The rollback decision point matters too—automated rollback based on metrics enables fast recovery but requires accurate failure detection. Manual rollback provides human judgment but introduces delay.

Monitoring and Observability capabilities affect deployment safety. Rolling updates require monitoring to detect failures early in the process. Applications with comprehensive metrics, logging, and alerting can safely use aggressive deployment schedules. Applications with limited observability should use conservative deployment schedules with longer pauses between batches. The deployment process should integrate with monitoring systems to automatically detect anomalies.

Database Migration Strategy heavily influences rolling update design. Expanding changes (adding columns, tables, or indexes) deploy safely—old code ignores new schema elements. Contracting changes (removing columns or tables) require multi-phase deployments. Modifying changes (changing column types or constraints) often require intermediate steps. The Expand-Migrate-Contract pattern addresses these challenges: expand the schema to support both versions, migrate data gradually, deploy new application code, and finally contract the schema to remove old elements.

# Expand-Migrate-Contract database migration example
# Phase 1: Expand - add new column while keeping old column
class AddUserEmailVerification < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :email_verified_at, :datetime
    # Keep existing email_verified boolean column
  end
end

# Phase 2: Deploy new code that writes to both columns
class User < ApplicationRecord
  before_save :sync_email_verification

  def sync_email_verification
    if email_verified_at_changed?
      self.email_verified = email_verified_at.present?
    elsif email_verified_changed?
      self.email_verified_at = email_verified ? Time.current : nil
    end
  end
end

# Phase 3: After rolling update completes, migrate data
class BackfillEmailVerifiedAt < ActiveRecord::Migration[7.0]
  def up
    User.where(email_verified: true).update_all(
      'email_verified_at = updated_at'
    )
  end
end

# Phase 4: Remove old column in next deployment
class RemoveEmailVerifiedBoolean < ActiveRecord::Migration[7.0]
  def change
    remove_column :users, :email_verified, :boolean
  end
end

Team Size and Expertise affects deployment complexity tolerance. Small teams may prefer simple deployment processes that require minimal specialized knowledge. Large teams with dedicated operations engineers can manage complex multi-stage deployments and sophisticated rollback procedures. The deployment process should match the team's ability to troubleshoot and recover from failures.

Ruby Implementation

Ruby applications deploy with rolling updates using various tools and techniques. The implementation depends on deployment target—whether bare metal servers, VMs, containers, or serverless platforms—and the orchestration system managing the deployment.

Capistrano Deployments represent traditional Ruby deployment automation. Capistrano connects to multiple servers via SSH, uploads new code, runs migrations, and restarts application servers. A rolling update with Capistrano requires custom tasks that update servers in batches rather than all at once.

# Capistrano rolling update configuration
# config/deploy.rb
set :application, 'myapp'
set :repo_url, 'git@github.com:example/myapp.git'
set :deploy_to, '/var/www/myapp'
set :rolling_batch_size, 2

namespace :deploy do
  task :rolling_restart do
    on roles(:web), in: :groups, limit: fetch(:rolling_batch_size) do |host|
      execute :sudo, :systemctl, :restart, 'myapp'
      # Wait for health check before proceeding to next batch
      sleep 10
      health_check(host)
    end
  end

  def health_check(host)
    max_attempts = 30
    attempts = 0
    
    while attempts < max_attempts
      begin
        result = test("curl -f http://#{host}:3000/health")
        return if result
      rescue
        sleep 2
        attempts += 1
      end
    end
    
    raise "Health check failed for #{host}"
  end
end

after 'deploy:publishing', 'deploy:rolling_restart'

Kubernetes Deployments provide native rolling update support for containerized Ruby applications. Kubernetes Deployments manage ReplicaSets that create and manage Pods. The rolling update strategy configures how Kubernetes replaces old Pods with new ones.

# kubernetes/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ruby-app
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # At most 1 pod can be unavailable
      maxSurge: 1        # At most 1 extra pod during update
  selector:
    matchLabels:
      app: ruby-app
  template:
    metadata:
      labels:
        app: ruby-app
    spec:
      containers:
      - name: ruby-app
        image: myregistry/ruby-app:v2
        ports:
        - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 10
# Ruby application health endpoint for Kubernetes probes
# app/controllers/health_controller.rb
class HealthController < ApplicationController
  def show
    # Check database connectivity
    ActiveRecord::Base.connection.execute('SELECT 1')
    
    # Check Redis connectivity if used
    redis = Redis.new
    redis.ping
    
    # Check critical dependencies
    render json: { 
      status: 'ready',
      version: ENV['APP_VERSION'],
      timestamp: Time.current
    }, status: :ok
  rescue => e
    render json: { 
      status: 'unhealthy',
      error: e.message 
    }, status: :service_unavailable
  end
end

Zero-Downtime Puma Restarts enable rolling updates on a single server by restarting worker processes without dropping connections. Puma's phased restart mode starts new workers with updated code while old workers finish processing existing requests.

# config/puma.rb
workers ENV.fetch('WEB_CONCURRENCY', 4)
threads_count = ENV.fetch('RAILS_MAX_THREADS', 5)
threads threads_count, threads_count

preload_app!

on_worker_boot do
  ActiveRecord::Base.establish_connection
end

# Enable phased restarts
phased_restart true
worker_timeout 30
worker_shutdown_timeout 30

# Systemd integration for zero-downtime deployments
on_restart do
  puts 'Refreshing Gemfile'
  ENV['BUNDLE_GEMFILE'] = "#{root}/Gemfile"
end
# Deployment script with phased restart
#!/bin/bash
# deploy.sh

set -e

echo "Pulling latest code..."
git pull origin main

echo "Installing dependencies..."
bundle install --deployment --without development test

echo "Running migrations..."
RAILS_ENV=production bundle exec rake db:migrate

echo "Precompiling assets..."
RAILS_ENV=production bundle exec rake assets:precompile

echo "Restarting Puma with phased restart..."
kill -SIGUSR1 $(cat tmp/pids/puma.pid)

echo "Deployment complete"

Database Migration Handling requires special attention during rolling updates. Migrations run before new code deploys, meaning old code must work with the new schema. Rails migration techniques support this requirement.

# Safe migration that works with rolling updates
class AddUserPreferences < ActiveRecord::Migration[7.0]
  def change
    # Add column with default value so old code doesn't fail
    add_column :users, :preferences, :jsonb, default: {}
    
    # Add index concurrently to avoid locking table
    add_index :users, :preferences, algorithm: :concurrently
  end
end

# Unsafe migration - breaks during rolling update
class ChangeUserEmailToRequired < ActiveRecord::Migration[7.0]
  def change
    # Don't do this! Old code will fail validation
    change_column_null :users, :email, false
  end
end

# Safe approach - multi-phase deployment
# Phase 1: Add validation at application level
class User < ApplicationRecord
  validates :email, presence: true
end

# Phase 2: After rolling update completes, add database constraint
class AddUserEmailConstraint < ActiveRecord::Migration[7.0]
  def change
    # First backfill any null values
    User.where(email: nil).update_all(email: 'unknown@example.com')
    
    # Then add constraint
    change_column_null :users, :email, false
  end
end

Background Job Processing during rolling updates requires managing job compatibility across versions. Jobs enqueued by old code may execute on workers running new code, and vice versa. Sidekiq and other background job processors need coordination to handle this gracefully.

# Versioned job implementation for rolling updates
class UserNotificationJob
  include Sidekiq::Job
  
  def perform(user_id, notification_type, options = {})
    # Handle both old and new argument formats
    options = parse_options(options)
    
    user = User.find(user_id)
    notifier = NotificationService.new(user)
    notifier.send_notification(notification_type, options)
  end
  
  private
  
  def parse_options(options)
    # Old version passed options as separate arguments
    # New version passes options as hash
    case options
    when Hash
      options
    when String
      # Old format: parse legacy option string
      { message: options }
    else
      {}
    end
  end
end

# Deploy strategy:
# 1. Deploy backward-compatible job that handles both formats
# 2. Wait for all old jobs to complete
# 3. Deploy new code that enqueues jobs in new format
# 4. Remove compatibility code in subsequent deployment

Real-World Applications

Production deployments demonstrate rolling update patterns, challenges, and solutions across different application types and scales.

Web Application Deployment represents the most common rolling update scenario. A Rails application serving web traffic needs to deploy new features without downtime. The application runs on Kubernetes with four pods behind a load balancer.

# Production deployment sequence
# 1. Database migration (backward compatible)
class AddUserAvatarUrl < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :avatar_url, :string
    add_index :users, :avatar_url
  end
end

# 2. Deploy new code that uses avatar_url but doesn't require it
class User < ApplicationRecord
  def avatar
    avatar_url || default_avatar_url
  end
  
  def default_avatar_url
    "https://cdn.example.com/avatars/default.png"
  end
end

# 3. Kubernetes applies rolling update automatically
# kubectl apply -f deployment.yaml
# kubectl rollout status deployment/rails-app

# 4. Monitor deployment progress
# kubectl get pods -l app=rails-app -w

The deployment encounters a problem: new pods fail health checks because of a configuration error. Kubernetes automatically pauses the rollout, preventing the issue from affecting all pods.

# Monitoring reveals the issue
kubectl describe pod rails-app-7d9f4c8-xyz
# Events show: Readiness probe failed: HTTP 500

# Check logs
kubectl logs rails-app-7d9f4c8-xyz
# Error: Redis connection refused

# Issue: New version expects Redis URL in different format
# Fix: Update configuration
kubectl set env deployment/rails-app REDIS_URL=redis://redis:6379

# Resume rollout
kubectl rollout resume deployment/rails-app

# If needed, rollback
kubectl rollout undo deployment/rails-app

API Service Deployment handles version compatibility challenges. An API service has external clients that expect specific response formats. The deployment introduces a new response format while maintaining backward compatibility.

# API versioning for backward compatibility during rolling updates
class Api::V1::UsersController < ApplicationController
  def show
    user = User.find(params[:id])
    
    render json: {
      id: user.id,
      name: user.name,
      email: user.email,
      # New field added in rolling update
      # Old clients ignore unknown fields
      avatar_url: user.avatar_url,
      # Deprecated field maintained for backward compatibility
      # Will remove in future deployment after all clients updated
      profile_image: user.avatar_url
    }
  end
end

# New API version served simultaneously
class Api::V2::UsersController < ApplicationController
  def show
    user = User.find(params[:id])
    
    render json: UserSerializer.new(user).as_json
  end
end

# Routing supports both versions during transition
Rails.application.routes.draw do
  namespace :api do
    namespace :v1 do
      resources :users
    end
    namespace :v2 do
      resources :users
    end
  end
end

Microservices Deployment coordinates updates across dependent services. Service A depends on Service B's API. Service B adds a new required field to its API. The deployment sequence must prevent breaking Service A.

# Service B - Initial state
class OrdersController < ApplicationController
  def create
    order = Order.create!(
      user_id: params[:user_id],
      items: params[:items]
    )
    render json: order
  end
end

# Phase 1: Service B adds optional field
class OrdersController < ApplicationController
  def create
    order = Order.create!(
      user_id: params[:user_id],
      items: params[:items],
      # New field optional during transition
      shipping_address: params[:shipping_address]
    )
    render json: order
  end
end

# Phase 2: Service A updated to send new field
class OrderService
  def create_order(user_id, items, shipping_address)
    HTTParty.post(
      "#{ENV['ORDERS_SERVICE_URL']}/orders",
      body: {
        user_id: user_id,
        items: items,
        shipping_address: shipping_address  # Now included
      }.to_json
    )
  end
end

# Phase 3: Service B makes field required (after Service A fully deployed)
class OrdersController < ApplicationController
  def create
    # Now require shipping_address
    unless params[:shipping_address].present?
      return render json: { error: 'shipping_address required' }, 
                    status: :bad_request
    end
    
    order = Order.create!(
      user_id: params[:user_id],
      items: params[:items],
      shipping_address: params[:shipping_address]
    )
    render json: order
  end
end

Background Worker Deployment manages worker processes handling asynchronous jobs. Workers must handle jobs created by both old and new application versions during the rolling update.

# Worker deployment with job version compatibility
class ImageProcessingWorker
  include Sidekiq::Job
  
  # Job enqueued format changed between versions
  # Old: perform(image_id)
  # New: perform(image_id, options_hash)
  
  def perform(*args)
    case args.length
    when 1
      # Old format
      process_image_legacy(args[0])
    when 2
      # New format
      process_image(args[0], args[1])
    else
      raise ArgumentError, "Invalid arguments: #{args.inspect}"
    end
  end
  
  def process_image_legacy(image_id)
    # Convert to new format with defaults
    process_image(image_id, { quality: 80, format: 'jpeg' })
  end
  
  def process_image(image_id, options)
    image = Image.find(image_id)
    processor = ImageProcessor.new(image, options)
    processor.process
    processor.save
  end
end

# Deployment strategy:
# 1. Deploy backward-compatible worker
# 2. Rolling update workers - old and new workers coexist
# 3. Deploy new application code that enqueues new format
# 4. Wait for queue to drain old-format jobs
# 5. Remove legacy code path in next deployment

Common Pitfalls

Rolling updates introduce failure modes that don't occur in simpler deployment strategies. Recognizing these pitfalls helps avoid production incidents and design more robust deployments.

Session Loss During Updates occurs when session data resides in application memory. A user's session exists on a specific instance. During a rolling update, that instance restarts, destroying the session. The user experiences unexpected logout or lost cart contents. Solutions include sticky sessions to keep users on the same instance, or externalizing session storage to Redis or a database.

# Problem: In-memory sessions lost during rolling update
# config/initializers/session_store.rb (bad)
Rails.application.config.session_store :cookie_store, key: '_myapp_session'

# Solution: Externalize session storage
# Gemfile
gem 'redis-rails'

# config/initializers/session_store.rb (good)
Rails.application.config.session_store :redis_store,
  servers: [ENV['REDIS_URL']],
  expire_after: 90.minutes,
  key: '_myapp_session',
  threadsafe: true,
  signed: true

Database Migration Conflicts happen when migrations modify schema in ways incompatible with old code. A common mistake adds a non-null column without a default value. Old code attempting to insert records fails because it doesn't know about the new column.

# Problem: Non-null column without default breaks old code
class AddUserPhoneNumber < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :phone_number, :string, null: false
  end
end
# Old code: User.create!(name: 'Alice', email: 'alice@example.com')
# Fails: PG::NotNullViolation: null value in column "phone_number"

# Solution 1: Add default value
class AddUserPhoneNumber < ActiveRecord::Migration[7.0]
  def change
    add_column :users, :phone_number, :string, null: false, default: ''
  end
end

# Solution 2: Make nullable initially
class AddUserPhoneNumber < ActiveRecord::Migration[7.0]
  def change
    # Phase 1: Add nullable column
    add_column :users, :phone_number, :string
  end
end

# Deploy new code that sets phone_number
# Phase 2: Backfill data
# Phase 3: Add not-null constraint
class MakePhoneNumberRequired < ActiveRecord::Migration[7.0]
  def change
    change_column_null :users, :phone_number, false
  end
end

Health Check Failures cause deployments to stall when new instances never become healthy. Health checks fail for various reasons: application crashes on startup, dependencies unavailable, configuration errors, or health check endpoint bugs. The deployment waits indefinitely for health checks to pass.

# Problem: Health check too strict
class HealthController < ApplicationController
  def show
    # Fails if ANY dependency unavailable
    Redis.new.ping
    Elasticsearch::Client.new.ping
    ExternalApiClient.new.health_check
    
    render json: { status: 'ok' }
  rescue => e
    render json: { status: 'error' }, status: :service_unavailable
  end
end

# Solution: Separate readiness from liveness
class HealthController < ApplicationController
  # Readiness: Can this instance serve traffic?
  def ready
    # Check only critical dependencies
    ActiveRecord::Base.connection.execute('SELECT 1')
    render json: { status: 'ready' }
  rescue => e
    render json: { status: 'not_ready', error: e.message }, 
           status: :service_unavailable
  end
  
  # Liveness: Is this instance alive?
  def alive
    # Lightweight check - just confirm process responding
    render json: { status: 'alive' }
  end
end

Breaking API Changes disrupt clients during mixed-version deployments. Removing an API field breaks old instances of other services or frontends expecting that field. Changing field types causes deserialization errors. Modifying business logic creates inconsistent behavior.

# Problem: Removing API field breaks old clients
# Old version
def show
  render json: {
    id: user.id,
    username: user.username,
    full_name: user.full_name
  }
end

# New version (breaks old clients)
def show
  render json: {
    id: user.id,
    display_name: user.full_name  # Renamed field
  }
end

# Solution: Maintain both fields during transition
def show
  render json: {
    id: user.id,
    display_name: user.full_name,
    # Deprecated but maintained for backward compatibility
    full_name: user.full_name,
    username: user.username  # Keep until clients migrated
  }
end

Insufficient Capacity During Updates causes performance degradation when remaining instances can't handle the load. The deployment removes instances from service, increasing load on remaining instances. Response times increase, timeouts occur, or the system becomes overloaded.

# Problem: Too aggressive batch size
class RollingUpdateConfig
  def batch_size
    # Updates 50% of instances simultaneously
    (total_instances * 0.5).ceil
  end
end
# During peak traffic, 50% capacity insufficient

# Solution: Consider capacity requirements
class RollingUpdateConfig
  def batch_size
    # Never reduce capacity below 75% during peak hours
    if peak_traffic_hours?
      (total_instances * 0.25).ceil
    else
      (total_instances * 0.5).ceil
    end
  end
  
  def peak_traffic_hours?
    Time.current.hour.in?(9..17) && Time.current.wday.in?(1..5)
  end
end

Race Conditions in Data Migrations occur when migrating data during a rolling update. Old and new code simultaneously modify the same data with different logic. The final state depends on execution order rather than business logic.

# Problem: Concurrent data migration
class BackfillUserScores < ActiveRecord::Migration[7.0]
  def up
    User.find_each do |user|
      user.update!(score: calculate_score(user))
    end
  end
end
# Old code also updating scores during deployment
# Final scores inconsistent

# Solution: Idempotent background job with locking
class BackfillUserScoresJob
  include Sidekiq::Job
  
  def perform(user_id)
    User.transaction do
      user = User.lock.find(user_id)
      
      # Only update if not already migrated
      return if user.score_migrated?
      
      user.update!(
        score: ScoreCalculator.new(user).calculate,
        score_migrated: true
      )
    end
  end
end

# Run migration gradually
class BackfillUserScores < ActiveRecord::Migration[7.0]
  def up
    # Just enqueue jobs, actual update happens asynchronously
    User.find_each do |user|
      BackfillUserScoresJob.perform_async(user.id)
    end
  end
end

Connection Draining Issues happen when instances terminate before finishing active requests. A load balancer removes an instance from the pool, but the instance has ongoing requests. Immediately terminating the instance interrupts those requests, causing errors for users.

# Solution: Graceful shutdown with connection draining
# config/puma.rb
workers 4
threads 5, 5

# Wait for connections to finish before shutdown
worker_shutdown_timeout 30

on_worker_shutdown do
  puts "Worker shutting down, draining connections..."
end

# Systemd unit configuration
# /etc/systemd/system/myapp.service
[Service]
Type=notify
# Send SIGTERM to start graceful shutdown
KillMode=mixed
KillSignal=SIGTERM
# Wait up to 60 seconds for graceful shutdown
TimeoutStopSec=60

Reference

Rolling Update Strategy Comparison

Strategy Resource Usage Deployment Speed Rollback Speed Complexity
Basic Rolling Current capacity Medium Medium Low
Surge Rolling Up to 2x capacity Fast Fast Medium
Blue-Green 2x capacity Instant switch Instant Medium
Canary Rolling Current + canary Slow Medium High
Recreate Current capacity Fast Medium Very Low

Configuration Parameters

Parameter Type Description Default
maxUnavailable int or % Maximum instances unavailable during update 25%
maxSurge int or % Maximum instances above desired count 25%
minReadySeconds int Minimum seconds before instance considered ready 0
progressDeadlineSeconds int Maximum time for deployment to progress 600
revisionHistoryLimit int Number of old ReplicaSets to retain 10

Health Check Types

Check Type Purpose Timing Failure Action
Readiness Determine if ready for traffic Continuous Remove from load balancer
Liveness Detect crashed instances Continuous Restart instance
Startup Give slow apps time to start Initial only Prevent premature liveness checks

Migration Safety Checklist

Operation Safe During Rolling Update Mitigation if Unsafe
Add column with default Yes None needed
Add column without default Yes Provide default value
Add column NOT NULL No Add as nullable, backfill, add constraint later
Remove column No Deprecate in code first, remove column later
Rename column No Add new column, dual-write, remove old column
Change column type No Add new column, migrate data, remove old
Add index Yes Use concurrent index creation
Remove index Yes Remove from code first
Add foreign key No Add as not validated, validate separately

Deployment Phase Template

Phase Actions Verification
Pre-deployment Run tests, build artifacts, database migrations Tests pass, migrations backward compatible
Canary Deploy to small subset Monitor metrics, error rates normal
Rolling Update Update remaining instances in batches Health checks pass, capacity sufficient
Post-deployment Smoke tests, monitor metrics Error rates normal, performance acceptable
Cleanup Remove old artifacts, deprecated code paths Storage reclaimed

Common Deployment Commands

Platform Deploy Command Rollback Command
Kubernetes kubectl apply -f deployment.yaml kubectl rollout undo deployment/app
Capistrano cap production deploy cap production deploy:rollback
AWS ECS aws ecs update-service aws ecs update-service with previous task definition
Heroku git push heroku main heroku rollback
Docker Swarm docker service update docker service rollback

Monitoring Metrics

Metric Target Alert Threshold Action
Error rate Below 0.1% Above 1% Pause deployment
Response time p95 Below 200ms Above 500ms Investigate performance
Health check success 100% Below 95% Check instance logs
Active connections Stable 50% drop Check load balancer
CPU usage Below 70% Above 90% Reduce batch size
Memory usage Below 80% Above 95% Check for memory leaks