CrackedRuby CrackedRuby

Overview

Data archival strategies address the challenge of managing data growth in production systems. As applications generate increasing volumes of data, maintaining all historical records in active databases degrades performance, increases storage costs, and complicates backup procedures. Archival strategies move older, infrequently accessed data to separate storage systems while preserving data integrity and accessibility for compliance, analytics, or historical reference.

The core concept distinguishes between active data requiring fast access and archived data accessed infrequently. Active data resides in production databases optimized for transactional workloads, while archived data moves to storage systems prioritizing cost efficiency over access speed. This separation improves database performance, reduces infrastructure costs, and maintains regulatory compliance without data loss.

Data archival differs from backup and deletion. Backups provide disaster recovery snapshots of entire systems, while archival selectively moves specific data based on age or usage patterns. Deletion permanently removes data, whereas archival preserves data in accessible but separate storage. Organizations often implement archival for audit logs, completed transactions, historical customer records, and time-series data that must be retained for compliance but rarely accessed.

Consider an e-commerce platform storing order records. Orders from the past five years remain in the production database for customer service queries and analytics. Orders older than five years move to archival storage, remaining accessible for tax audits or legal inquiries but no longer impacting database performance. The production database maintains query speed, while the business satisfies retention requirements and reduces storage costs.

Archival strategies intersect with data lifecycle management, compliance requirements, and system performance optimization. Effective archival requires careful planning around retention policies, access patterns, restoration procedures, and data format preservation across years or decades.

Key Principles

Data archival operates on several fundamental principles that guide strategy selection and implementation. Understanding these principles helps organizations design archival systems that balance cost, accessibility, and compliance requirements.

Retention policy definition establishes which data archives and when. Policies specify retention periods based on regulatory requirements, business needs, and data classification. Different data types may have different retention requirements—financial records might require seven-year retention while application logs need only 90 days. Policies must account for legal holds that prevent archival or deletion during litigation or investigations.

Data classification categorizes data based on access frequency, business value, and regulatory requirements. Hot data requires immediate access and remains in production systems. Warm data sees occasional access and may reside in near-line storage. Cold data rarely sees access and moves to low-cost archival storage. Deep archive stores data that may never be accessed but must be retained for compliance.

Access pattern consideration determines archival strategy viability. Data accessed frequently should not be archived regardless of age. Archival works best for data with predictable decline in access frequency over time. Applications must track access patterns to identify archival candidates and validate archival decisions don't harm user experience.

Data integrity preservation ensures archived data remains accurate and complete throughout its retention period. Archival processes must validate data integrity during transfer, prevent corruption in long-term storage, and verify data remains readable years after archival. This requires format stability, checksum verification, and periodic data validation.

Compliance alignment matches archival strategies to regulatory requirements. Many industries mandate specific retention periods, data formats, and access controls for archived data. Healthcare organizations must comply with HIPAA requirements for patient records. Financial institutions follow regulations specifying retention periods for transaction records. Archival strategies must support e-discovery requests, audit trails, and chain of custody documentation.

Performance impact minimization prevents archival operations from degrading production system performance. Bulk data transfers can saturate network bandwidth and overload storage systems. Archival processes typically run during low-traffic periods, use rate limiting, and process data in small batches. The archival process itself should not become a performance bottleneck.

Cost optimization balances storage costs against access requirements. Archival storage costs significantly less than production storage, but retrieval may incur fees and delays. Organizations must analyze the total cost including storage, retrieval, and operational overhead. Cloud providers offer multiple storage tiers with different cost-performance characteristics that inform strategy selection.

Restoration capability ensures archived data can be retrieved when needed. Archival systems must support both individual record retrieval and bulk restoration. Restoration time objectives define acceptable delays for data access. Critical data may require faster restoration than rarely accessed compliance records. The restoration process should validate data integrity and handle format conversions if necessary.

Design Considerations

Selecting an appropriate archival strategy requires analyzing multiple factors affecting implementation feasibility, operational overhead, and long-term viability. Different strategies suit different use cases, and organizations often implement multiple approaches for different data types.

Data access frequency patterns determine whether archival makes sense. If data older than a threshold sees less than 5% of total queries, archival likely improves performance and reduces costs. If old data receives frequent access, archival may hurt user experience by introducing retrieval delays. Organizations should analyze query logs to understand access patterns before implementing archival.

Compliance and regulatory requirements constrain archival options. Some regulations mandate specific storage formats, encryption requirements, or geographic restrictions. Financial services regulations may require archived data remain searchable within specific timeframes. Healthcare regulations may prohibit storing patient data in certain jurisdictions. The archival strategy must satisfy all applicable requirements while minimizing operational complexity.

Recovery time objectives specify how quickly archived data must be accessible. Some use cases tolerate hours or days for data retrieval, making cold storage viable. Other scenarios require near-instant access to archived data, necessitating more expensive storage solutions. Customer-facing features typically demand faster restoration than internal audit functions.

Storage cost versus access cost trade-offs affect strategy selection. Cloud providers charge differently for storage and retrieval. Amazon Glacier offers very low storage costs but charges for data retrieval and requires hours for access. S3 Standard costs more for storage but provides immediate access. Analyzing access frequency helps determine the optimal storage tier. Infrequently accessed data benefits from cheaper storage despite retrieval costs.

Data volume and growth rate impact archival frequency and batch sizes. High-volume systems may need daily archival to prevent database growth from affecting performance. Lower-volume systems might archive monthly or quarterly. Extremely large datasets require careful attention to transfer times and bandwidth constraints. The archival process must keep pace with data generation to prevent unbounded growth.

Query complexity for archived data affects strategy viability. Simple key-based lookups work well with most archival strategies. Complex analytical queries spanning archived and active data require either expensive cross-storage queries or strategies that preserve query capability after archival. Time-series databases often partition data by time period, allowing queries to target specific partitions efficiently.

Format longevity and migration concerns grow important for multi-year retention. File formats and database schemas evolve, potentially making old archived data unreadable. The archival strategy should either use stable formats with long-term guarantees or include periodic migration to current formats. JSON and CSV formats offer better long-term stability than proprietary binary formats.

Integration with existing infrastructure determines implementation complexity. Archival strategies requiring significant architectural changes may not be feasible for legacy systems. The ideal strategy works with existing tools, workflows, and permissions systems. Cloud-native applications can more easily adopt cloud storage for archival than applications tightly coupled to on-premises infrastructure.

Implementation Approaches

Data archival strategies vary in complexity, cost, and suitability for different scenarios. Organizations often combine multiple approaches for different data types or access patterns.

Time-based partitioning divides data into separate partitions based on creation date or timestamp. Database tables partition by month or year, with older partitions archived separately. This approach simplifies archival by allowing entire partitions to move to archival storage. Queries can target specific partitions based on date ranges, maintaining performance. PostgreSQL and MySQL support table partitioning natively, making implementation straightforward. Time-based partitioning works well for time-series data, logs, and transactions with temporal characteristics.

Separate archival database maintains a dedicated database instance for archived data. Active data remains in the production database while archival processes periodically move old records to the archival database. This approach preserves data structure and query capability while isolating performance impact. Applications can query the archival database when needed using the same database driver and query language. The archival database can run on cheaper hardware or in a different geographic region. This strategy suits applications where occasional queries need to span historical data but most queries target recent data.

Object storage migration moves archived data from databases to object storage like Amazon S3 or Azure Blob Storage. Data exports to files in formats like JSON, CSV, or Parquet, then uploads to object storage. Object storage costs significantly less than database storage and scales to petabytes. However, data loses queryability without additional tooling. This approach works well for compliance archival where data rarely needs retrieval. When access is needed, applications download files from object storage and process them locally.

Data warehouse consolidation moves archived data to a separate data warehouse optimized for analytical queries. Transactional databases archive old data to analytical systems like Amazon Redshift, Google BigQuery, or Snowflake. The data warehouse provides efficient querying across large historical datasets while removing load from transactional systems. This strategy suits organizations already operating data warehouses and needing analytical access to historical data.

Hybrid tiered storage combines multiple storage systems in a hierarchy. Recent data resides in fast databases, warm data moves to slower database instances, and cold data migrates to object storage. Applications query the appropriate tier based on data age. This approach optimizes cost by matching storage costs to access patterns. Implementation complexity increases due to managing multiple storage systems and query routing logic.

Database table archival with compression keeps archived data in the same database but moves it to separate tables with aggressive compression enabled. PostgreSQL and MySQL support table-level compression that can reduce storage by 70-90% for text-heavy data. Archived tables may be in separate tablespaces on slower storage. This approach maintains query capability while reducing storage costs and production table size. The strategy works well when database migration is impractical but performance improvement is needed.

Microservice-based archival service implements archival as a dedicated service that handles archival for multiple applications. The service provides APIs for archiving, searching, and retrieving data. This centralizes archival logic, improves consistency across applications, and allows specialized optimization. The archival service can manage multiple storage backends, handle format conversions, and enforce retention policies uniformly. This approach suits microservice architectures where multiple services generate data requiring archival.

Ruby Implementation

Ruby applications implement data archival through various libraries and patterns. ActiveRecord provides database-level capabilities while background job frameworks handle the asynchronous nature of archival operations.

Basic ActiveRecord archival moves records from production tables to archival tables. This approach maintains data structure while separating active and archived data.

class Order < ApplicationRecord
  def self.archive_old_orders(cutoff_date)
    old_orders = where("created_at < ?", cutoff_date)
    
    old_orders.find_each(batch_size: 1000) do |order|
      ArchivedOrder.create!(order.attributes.except('id'))
      order.destroy
    end
  end
end

class ArchivedOrder < ApplicationRecord
  # Mirror of Order model for archival table
end

Partition-based archival uses PostgreSQL table partitioning through ActiveRecord migrations. This allows entire partitions to be detached for archival.

class CreatePartitionedOrders < ActiveRecord::Migration[7.0]
  def up
    execute <<-SQL
      CREATE TABLE orders (
        id bigserial,
        created_at timestamp NOT NULL,
        -- other columns
        PRIMARY KEY (id, created_at)
      ) PARTITION BY RANGE (created_at);
      
      CREATE TABLE orders_2024 PARTITION OF orders
        FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
        
      CREATE TABLE orders_2023 PARTITION OF orders
        FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
    SQL
  end
end

class ArchivalService
  def archive_partition(year)
    ActiveRecord::Base.connection.execute(
      "ALTER TABLE orders DETACH PARTITION orders_#{year}"
    )
    
    # Move detached partition to archival database
    dump_partition(year)
    upload_to_s3(year)
    drop_partition(year)
  end
end

S3 archival with AWS SDK exports data to JSON and uploads to S3 for cost-effective long-term storage.

require 'aws-sdk-s3'

class S3ArchivalService
  def initialize
    @s3_client = Aws::S3::Client.new(
      region: ENV['AWS_REGION'],
      access_key_id: ENV['AWS_ACCESS_KEY_ID'],
      secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
    )
    @bucket = ENV['ARCHIVAL_BUCKET']
  end
  
  def archive_orders(start_date, end_date)
    orders = Order.where(created_at: start_date..end_date)
    
    file_path = "/tmp/orders_#{start_date}_#{end_date}.json.gz"
    
    Zlib::GzipWriter.open(file_path) do |gz|
      orders.find_each do |order|
        gz.write(order.to_json)
        gz.write("\n")
      end
    end
    
    key = "orders/#{start_date.year}/#{File.basename(file_path)}"
    
    @s3_client.put_object(
      bucket: @bucket,
      key: key,
      body: File.read(file_path),
      storage_class: 'GLACIER_IR' # Instant retrieval Glacier
    )
    
    File.delete(file_path)
    orders.delete_all
    
    key
  end
  
  def retrieve_archived_orders(key)
    response = @s3_client.get_object(bucket: @bucket, key: key)
    
    orders = []
    Zlib::GzipReader.new(StringIO.new(response.body.read)).each_line do |line|
      orders << JSON.parse(line)
    end
    
    orders
  end
end

Background job archival with Sidekiq processes archival asynchronously to avoid blocking application threads.

class ArchiveOldDataJob
  include Sidekiq::Job
  
  sidekiq_options queue: :low_priority, retry: 3
  
  def perform(model_name, cutoff_date)
    model = model_name.constantize
    batch_size = 1000
    
    loop do
      batch = model.where("created_at < ?", cutoff_date)
                   .limit(batch_size)
      
      break if batch.empty?
      
      batch.each do |record|
        archive_record(model_name, record)
      end
      
      sleep(1) # Rate limiting
    end
  end
  
  private
  
  def archive_record(model_name, record)
    archival_model = "Archived#{model_name}".constantize
    archival_model.create!(record.attributes.except('id'))
    record.destroy
  end
end

# Schedule monthly archival
class ArchivalScheduler
  def self.schedule_monthly_archival
    cutoff_date = 6.months.ago
    
    ['Order', 'Payment', 'ActivityLog'].each do |model_name|
      ArchiveOldDataJob.perform_async(model_name, cutoff_date)
    end
  end
end

Database dump archival exports data to SQL dumps for offline storage with compression.

class DatabaseDumpArchival
  def archive_table_data(table_name, start_date, end_date)
    timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
    filename = "#{table_name}_#{start_date}_#{end_date}_#{timestamp}.sql.gz"
    filepath = "/tmp/#{filename}"
    
    # Create compressed dump
    command = <<~CMD
      pg_dump \
        --host=#{db_config['host']} \
        --username=#{db_config['username']} \
        --table=#{table_name} \
        --where="created_at >= '#{start_date}' AND created_at < '#{end_date}'" \
        #{db_config['database']} | gzip > #{filepath}
    CMD
    
    system(command)
    
    # Upload to archival storage
    upload_to_archival_storage(filepath, filename)
    
    # Delete archived data from production
    ActiveRecord::Base.connection.execute(
      "DELETE FROM #{table_name} WHERE created_at >= '#{start_date}' AND created_at < '#{end_date}'"
    )
    
    File.delete(filepath)
  end
  
  private
  
  def db_config
    ActiveRecord::Base.connection_db_config.configuration_hash
  end
  
  def upload_to_archival_storage(filepath, filename)
    # Upload to S3, NFS, or other archival storage
  end
end

Archival with checksum verification ensures data integrity during archival by verifying checksums before deletion.

require 'digest'

class VerifiedArchivalService
  def archive_with_verification(records, storage_adapter)
    manifest = {
      archived_at: Time.now,
      record_count: records.count,
      checksums: []
    }
    
    records.each do |record|
      json_data = record.to_json
      checksum = Digest::SHA256.hexdigest(json_data)
      
      storage_adapter.write(record.id, json_data)
      
      manifest[:checksums] << {
        id: record.id,
        checksum: checksum
      }
    end
    
    # Verify before deletion
    verification_passed = verify_archived_data(manifest, storage_adapter)
    
    if verification_passed
      records.each(&:destroy)
      storage_adapter.write_manifest(manifest)
      { success: true, archived_count: records.count }
    else
      { success: false, error: 'Verification failed' }
    end
  end
  
  private
  
  def verify_archived_data(manifest, storage_adapter)
    manifest[:checksums].all? do |entry|
      stored_data = storage_adapter.read(entry[:id])
      stored_checksum = Digest::SHA256.hexdigest(stored_data)
      stored_checksum == entry[:checksum]
    end
  end
end

Multi-tier archival service implements progressive archival through multiple storage tiers based on data age.

class MultiTierArchivalService
  TIERS = {
    warm: { age: 90.days, storage: :fast_db },
    cold: { age: 1.year, storage: :s3_standard },
    frozen: { age: 3.years, storage: :s3_glacier }
  }
  
  def process_tiering
    TIERS.each do |tier_name, config|
      cutoff_date = config[:age].ago
      next_tier = next_tier_for(tier_name)
      
      move_to_tier(tier_name, next_tier, cutoff_date) if next_tier
    end
  end
  
  private
  
  def move_to_tier(current_tier, next_tier, cutoff_date)
    records = fetch_records_for_tier(current_tier, cutoff_date)
    
    records.find_each(batch_size: 500) do |record|
      storage_adapter_for(next_tier).store(record)
      storage_adapter_for(current_tier).delete(record.id)
    end
  end
  
  def next_tier_for(tier_name)
    tier_order = TIERS.keys
    current_index = tier_order.index(tier_name)
    tier_order[current_index + 1]
  end
  
  def storage_adapter_for(tier_name)
    case TIERS[tier_name][:storage]
    when :fast_db
      DatabaseStorageAdapter.new
    when :s3_standard
      S3StorageAdapter.new(storage_class: 'STANDARD')
    when :s3_glacier
      S3StorageAdapter.new(storage_class: 'GLACIER')
    end
  end
end

Performance Considerations

Data archival decisions significantly affect system performance across multiple dimensions. Understanding performance implications helps design archival strategies that improve production system responsiveness while maintaining data accessibility.

Database query performance improves dramatically when table sizes decrease through archival. Database indexes perform better on smaller tables, query planners generate more efficient execution plans, and table scans complete faster. A table with 100 million rows scanning at 50ms may drop to 10ms after archiving 80% of records. The performance benefit compounds as data continues growing—without archival, query times increase linearly or worse with table size.

Index maintenance overhead decreases with smaller tables. Every insert, update, or delete on indexed columns requires index updates. Large indexes consume more memory and disk I/O. Archiving old data reduces index size, making index updates faster and freeing cache memory for frequently accessed data. Applications seeing high write volumes benefit most from this reduction.

Backup and restore times decrease proportionally to database size reduction. A database backup taking six hours might complete in two hours after archiving. Restore operations similarly benefit. Faster backups and restores improve disaster recovery objectives and reduce maintenance windows. Archival also reduces backup storage costs since smaller databases generate smaller backup files.

Archival operation performance requires careful management to avoid impacting production workloads. Bulk deletion of archived records can lock tables, preventing concurrent access. Breaking archival into small batches with delays between batches spreads the load over time. Performing archival during low-traffic periods minimizes user impact. Network bandwidth becomes a constraint when transferring large data volumes to remote archival storage.

Data transfer bottlenecks emerge when moving large datasets. Transferring terabytes of data to cloud storage may saturate network connections or consume transfer quotas. Compression before transfer reduces bandwidth requirements but adds CPU overhead. Estimating transfer times helps schedule archival operations appropriately. A 1TB dataset over a 1Gbps connection requires minimum 2.5 hours without considering protocol overhead.

Storage I/O patterns affect archival performance differently across storage types. Object storage handles large sequential writes efficiently but struggles with small random writes. Database storage performs better with small transactions but may slow with large bulk operations. Archival strategies should match their I/O patterns to storage characteristics. Writing to object storage in large batches performs better than individual record writes.

Retrieval performance trade-offs vary by storage tier. Amazon S3 Standard provides millisecond retrieval while Glacier Deep Archive requires 12-48 hours. The cost savings of cheaper storage must justify retrieval delays for the use case. Applications needing occasional access to archived data should avoid ultra-cold storage tiers. Understanding retrieval SLAs prevents choosing archival strategies incompatible with business requirements.

Query performance across archived and active data presents challenges. Queries spanning active and archived data may need to query multiple systems and merge results. This increases latency and complexity compared to single-system queries. Materializing archived data back to production databases for analysis defeats archival purposes. Data warehouse solutions mitigate this by providing unified query interfaces across storage tiers.

Compression impact reduces storage costs and transfer times but adds CPU overhead. Compressing archived data before storage decreases costs by 60-90% for text data. Decompression during retrieval adds latency. Applications should balance compression ratio against CPU cost. Streaming compression during archival transfers prevents storing uncompressed temporary files.

Connection pool saturation can occur during archival operations if too many concurrent database connections perform archival queries. Archival processes should use separate connection pools from application connections or limit concurrent archival workers. Saturating connection pools degrades application performance during archival windows.

Memory consumption during archival requires monitoring. Loading large result sets into memory for processing can exhaust available RAM. Streaming approaches that process records individually or in small batches prevent memory exhaustion. Ruby's find_each method in ActiveRecord provides efficient batch processing that maintains constant memory usage regardless of dataset size.

Tools & Ecosystem

Multiple tools and services facilitate data archival implementation. Ruby applications can integrate these tools through libraries and APIs to build comprehensive archival solutions.

AWS S3 and Glacier provide object storage with multiple tiers for different access patterns. S3 Standard offers immediate access, S3 Infrequent Access reduces costs for less frequent access, Glacier Flexible Retrieval provides low-cost storage with hours-long retrieval, and Glacier Deep Archive offers the lowest cost with 12-48 hour retrieval. The aws-sdk-s3 gem provides Ruby integration.

require 'aws-sdk-s3'

client = Aws::S3::Client.new(region: 'us-east-1')

# Upload with lifecycle transition
client.put_object(
  bucket: 'archival-bucket',
  key: 'orders/2020/data.json.gz',
  body: compressed_data,
  storage_class: 'STANDARD_IA'
)

# Configure bucket lifecycle
client.put_bucket_lifecycle_configuration(
  bucket: 'archival-bucket',
  lifecycle_configuration: {
    rules: [{
      id: 'archive-old-data',
      status: 'Enabled',
      transitions: [
        { days: 90, storage_class: 'GLACIER' },
        { days: 365, storage_class: 'DEEP_ARCHIVE' }
      ]
    }]
  }
)

PostgreSQL table partitioning manages data archival through declarative partitioning. Tables partition by range, list, or hash. Partitions can be detached for archival or dropped entirely. The pg gem provides Ruby interface to PostgreSQL-specific features.

MySQL archival storage engine provides transparent compression for archived tables. The ARCHIVE storage engine compresses rows as they're inserted and doesn't support indexes or updates, making it suitable for append-only archival data.

Sidekiq and delayed_job provide background job processing for asynchronous archival operations. These libraries handle job scheduling, retry logic, and monitoring, preventing archival from blocking application threads.

Logrotate manages log file archival on Linux systems. Configuration files specify rotation frequency, compression, and retention periods. Applications writing to log files benefit from logrotate's automatic archival.

# /etc/logrotate.d/rails_app
/var/log/rails_app/*.log {
  daily
  missingok
  rotate 30
  compress
  delaycompress
  notifempty
  create 0640 rails rails
  sharedscripts
  postrotate
    systemctl reload rails_app
  endscript
}

Elasticsearch curator manages time-series index archival in Elasticsearch clusters. Curator can close, delete, or snapshot old indices based on age or size criteria. This prevents Elasticsearch clusters from growing unbounded with historical data.

Apache Parquet provides columnar storage format efficient for analytical workloads. Archiving data to Parquet format enables efficient querying in data warehouses while achieving high compression ratios. The parquet gem offers Ruby support.

SQLite for archival databases works well for smaller archival datasets requiring occasional queries. Each archived period can store in separate SQLite files, providing simple file-based archival with full SQL query capability.

require 'sqlite3'

def create_archival_database(year)
  db = SQLite3::Database.new("orders_archive_#{year}.db")
  
  db.execute <<-SQL
    CREATE TABLE orders (
      id INTEGER PRIMARY KEY,
      created_at TEXT,
      customer_id INTEGER,
      total_amount DECIMAL,
      data TEXT
    )
  SQL
  
  db.execute "CREATE INDEX idx_created_at ON orders(created_at)"
  
  db
end

def archive_to_sqlite(year)
  archive_db = create_archival_database(year)
  
  Order.where("strftime('%Y', created_at) = ?", year.to_s)
       .find_each(batch_size: 1000) do |order|
    archive_db.execute(
      "INSERT INTO orders VALUES (?, ?, ?, ?, ?)",
      [order.id, order.created_at, order.customer_id, 
       order.total_amount, order.to_json]
    )
  end
  
  archive_db.close
end

Database backup tools like pg_dump and mysqldump export data for archival. These tools support filtering by table or conditions, compression, and parallel export. The backed-up files serve as archival storage.

Cloud provider lifecycle policies automate storage tier transitions. AWS S3, Google Cloud Storage, and Azure Blob Storage support lifecycle rules that automatically move objects between storage tiers based on age or access patterns. This reduces operational overhead for multi-tier archival.

Data warehouse solutions like Amazon Redshift, Snowflake, and Google BigQuery provide archival storage with analytical query capability. These systems optimize for infrequent writes and complex analytical queries across large datasets.

Compression libraries reduce archival storage requirements. The zlib, bzip2, and lz4 gems provide different compression algorithms with varying trade-offs between compression ratio, speed, and CPU usage. Gzip compression typically achieves 10x reduction for text data.

Monitoring tools track archival operation health. Prometheus and Datadog can monitor archival job success rates, processing times, data volumes, and errors. Alerting on archival failures prevents compliance issues from unnoticed archival problems.

Reference

Archival Strategy Comparison

Strategy Access Speed Cost Query Capability Complexity
Archival table Fast Medium Full SQL Low
Separate database Fast Medium Full SQL Medium
Object storage Slow Low None Medium
Data warehouse Medium Medium Analytical High
Compressed dump Very slow Very low None Low
Hybrid tiered Variable Low Limited High

Storage Tier Characteristics

Tier Retrieval Time Cost per GB/month Use Case
S3 Standard Immediate $0.023 Warm data
S3 Standard-IA Immediate $0.0125 Infrequent access
S3 Glacier IR Immediate $0.004 Archival with instant access
S3 Glacier Flexible 1-5 hours $0.0036 Rarely accessed archival
S3 Deep Archive 12-48 hours $0.00099 Long-term compliance

Archival Decision Matrix

Data Age Access Frequency Recommended Strategy
< 90 days High Production database
90-365 days Medium Separate database or warm storage
1-3 years Low Object storage or cold tier
> 3 years Very low Deep archive or tape

Common Retention Policies

Industry Data Type Typical Retention
Financial Transaction records 7 years
Healthcare Patient records 6 years after last visit
E-commerce Order history 3-7 years
SaaS User activity logs 90 days to 1 year
Government Public records Permanent

ActiveRecord Archival Methods

Method Purpose Performance Impact
find_each Batch iteration Low memory usage
delete_all Bulk deletion Fast but no callbacks
destroy_all Individual deletion Slow with callbacks
insert_all Bulk insertion Fast without callbacks
transaction Atomic operations Ensures consistency

PostgreSQL Partition Commands

Command Purpose
CREATE TABLE ... PARTITION BY RANGE Create partitioned table
CREATE TABLE ... PARTITION OF Create partition
ALTER TABLE DETACH PARTITION Remove partition from parent
ALTER TABLE ATTACH PARTITION Add existing table as partition
DROP TABLE Delete detached partition

Compression Algorithm Comparison

Algorithm Compression Ratio Speed CPU Usage Gem
Gzip 70-80% Medium Medium zlib
Bzip2 80-90% Slow High bzip2-ffi
LZ4 50-60% Very fast Low lz4-ruby
Zstd 70-80% Fast Low zstd-ruby

Archival Process Checklist

Step Verification
Define retention policy Document business and compliance requirements
Identify archival candidates Analyze access patterns and data age
Choose storage tier Match cost and access needs
Implement archival logic Write and test archival code
Verify data integrity Checksum validation before deletion
Test restoration Ensure archived data retrievable
Monitor and alert Track success rates and failures
Document procedures Maintain runbooks for operations

S3 Lifecycle Configuration

Parameter Description Example Value
ID Rule identifier archive-policy-001
Status Rule state Enabled
Prefix Object path filter orders/2020/
Transitions Storage class changes 90 days to Glacier
Expiration Deletion timing 2555 days
NoncurrentVersionExpiration Delete old versions 30 days

Ruby Archival Gems

Gem Purpose Features
aws-sdk-s3 S3 integration Multi-tier storage, lifecycle policies
pg PostgreSQL driver Partition support, native types
sqlite3 SQLite integration File-based archival databases
sidekiq Background jobs Async archival processing
zlib Compression Gzip compression/decompression
parallel Parallel processing Multi-threaded archival

Monitoring Metrics

Metric Purpose Alert Threshold
Archival success rate Track failures < 95%
Processing time Detect slowdowns > 2x baseline
Data volume archived Capacity planning Monitor trends
Storage costs Budget management > threshold
Restoration time SLA compliance > defined SLA
Error count Operational health > 5 per hour