Overview
Data archival strategies address the challenge of managing data growth in production systems. As applications generate increasing volumes of data, maintaining all historical records in active databases degrades performance, increases storage costs, and complicates backup procedures. Archival strategies move older, infrequently accessed data to separate storage systems while preserving data integrity and accessibility for compliance, analytics, or historical reference.
The core concept distinguishes between active data requiring fast access and archived data accessed infrequently. Active data resides in production databases optimized for transactional workloads, while archived data moves to storage systems prioritizing cost efficiency over access speed. This separation improves database performance, reduces infrastructure costs, and maintains regulatory compliance without data loss.
Data archival differs from backup and deletion. Backups provide disaster recovery snapshots of entire systems, while archival selectively moves specific data based on age or usage patterns. Deletion permanently removes data, whereas archival preserves data in accessible but separate storage. Organizations often implement archival for audit logs, completed transactions, historical customer records, and time-series data that must be retained for compliance but rarely accessed.
Consider an e-commerce platform storing order records. Orders from the past five years remain in the production database for customer service queries and analytics. Orders older than five years move to archival storage, remaining accessible for tax audits or legal inquiries but no longer impacting database performance. The production database maintains query speed, while the business satisfies retention requirements and reduces storage costs.
Archival strategies intersect with data lifecycle management, compliance requirements, and system performance optimization. Effective archival requires careful planning around retention policies, access patterns, restoration procedures, and data format preservation across years or decades.
Key Principles
Data archival operates on several fundamental principles that guide strategy selection and implementation. Understanding these principles helps organizations design archival systems that balance cost, accessibility, and compliance requirements.
Retention policy definition establishes which data archives and when. Policies specify retention periods based on regulatory requirements, business needs, and data classification. Different data types may have different retention requirements—financial records might require seven-year retention while application logs need only 90 days. Policies must account for legal holds that prevent archival or deletion during litigation or investigations.
Data classification categorizes data based on access frequency, business value, and regulatory requirements. Hot data requires immediate access and remains in production systems. Warm data sees occasional access and may reside in near-line storage. Cold data rarely sees access and moves to low-cost archival storage. Deep archive stores data that may never be accessed but must be retained for compliance.
Access pattern consideration determines archival strategy viability. Data accessed frequently should not be archived regardless of age. Archival works best for data with predictable decline in access frequency over time. Applications must track access patterns to identify archival candidates and validate archival decisions don't harm user experience.
Data integrity preservation ensures archived data remains accurate and complete throughout its retention period. Archival processes must validate data integrity during transfer, prevent corruption in long-term storage, and verify data remains readable years after archival. This requires format stability, checksum verification, and periodic data validation.
Compliance alignment matches archival strategies to regulatory requirements. Many industries mandate specific retention periods, data formats, and access controls for archived data. Healthcare organizations must comply with HIPAA requirements for patient records. Financial institutions follow regulations specifying retention periods for transaction records. Archival strategies must support e-discovery requests, audit trails, and chain of custody documentation.
Performance impact minimization prevents archival operations from degrading production system performance. Bulk data transfers can saturate network bandwidth and overload storage systems. Archival processes typically run during low-traffic periods, use rate limiting, and process data in small batches. The archival process itself should not become a performance bottleneck.
Cost optimization balances storage costs against access requirements. Archival storage costs significantly less than production storage, but retrieval may incur fees and delays. Organizations must analyze the total cost including storage, retrieval, and operational overhead. Cloud providers offer multiple storage tiers with different cost-performance characteristics that inform strategy selection.
Restoration capability ensures archived data can be retrieved when needed. Archival systems must support both individual record retrieval and bulk restoration. Restoration time objectives define acceptable delays for data access. Critical data may require faster restoration than rarely accessed compliance records. The restoration process should validate data integrity and handle format conversions if necessary.
Design Considerations
Selecting an appropriate archival strategy requires analyzing multiple factors affecting implementation feasibility, operational overhead, and long-term viability. Different strategies suit different use cases, and organizations often implement multiple approaches for different data types.
Data access frequency patterns determine whether archival makes sense. If data older than a threshold sees less than 5% of total queries, archival likely improves performance and reduces costs. If old data receives frequent access, archival may hurt user experience by introducing retrieval delays. Organizations should analyze query logs to understand access patterns before implementing archival.
Compliance and regulatory requirements constrain archival options. Some regulations mandate specific storage formats, encryption requirements, or geographic restrictions. Financial services regulations may require archived data remain searchable within specific timeframes. Healthcare regulations may prohibit storing patient data in certain jurisdictions. The archival strategy must satisfy all applicable requirements while minimizing operational complexity.
Recovery time objectives specify how quickly archived data must be accessible. Some use cases tolerate hours or days for data retrieval, making cold storage viable. Other scenarios require near-instant access to archived data, necessitating more expensive storage solutions. Customer-facing features typically demand faster restoration than internal audit functions.
Storage cost versus access cost trade-offs affect strategy selection. Cloud providers charge differently for storage and retrieval. Amazon Glacier offers very low storage costs but charges for data retrieval and requires hours for access. S3 Standard costs more for storage but provides immediate access. Analyzing access frequency helps determine the optimal storage tier. Infrequently accessed data benefits from cheaper storage despite retrieval costs.
Data volume and growth rate impact archival frequency and batch sizes. High-volume systems may need daily archival to prevent database growth from affecting performance. Lower-volume systems might archive monthly or quarterly. Extremely large datasets require careful attention to transfer times and bandwidth constraints. The archival process must keep pace with data generation to prevent unbounded growth.
Query complexity for archived data affects strategy viability. Simple key-based lookups work well with most archival strategies. Complex analytical queries spanning archived and active data require either expensive cross-storage queries or strategies that preserve query capability after archival. Time-series databases often partition data by time period, allowing queries to target specific partitions efficiently.
Format longevity and migration concerns grow important for multi-year retention. File formats and database schemas evolve, potentially making old archived data unreadable. The archival strategy should either use stable formats with long-term guarantees or include periodic migration to current formats. JSON and CSV formats offer better long-term stability than proprietary binary formats.
Integration with existing infrastructure determines implementation complexity. Archival strategies requiring significant architectural changes may not be feasible for legacy systems. The ideal strategy works with existing tools, workflows, and permissions systems. Cloud-native applications can more easily adopt cloud storage for archival than applications tightly coupled to on-premises infrastructure.
Implementation Approaches
Data archival strategies vary in complexity, cost, and suitability for different scenarios. Organizations often combine multiple approaches for different data types or access patterns.
Time-based partitioning divides data into separate partitions based on creation date or timestamp. Database tables partition by month or year, with older partitions archived separately. This approach simplifies archival by allowing entire partitions to move to archival storage. Queries can target specific partitions based on date ranges, maintaining performance. PostgreSQL and MySQL support table partitioning natively, making implementation straightforward. Time-based partitioning works well for time-series data, logs, and transactions with temporal characteristics.
Separate archival database maintains a dedicated database instance for archived data. Active data remains in the production database while archival processes periodically move old records to the archival database. This approach preserves data structure and query capability while isolating performance impact. Applications can query the archival database when needed using the same database driver and query language. The archival database can run on cheaper hardware or in a different geographic region. This strategy suits applications where occasional queries need to span historical data but most queries target recent data.
Object storage migration moves archived data from databases to object storage like Amazon S3 or Azure Blob Storage. Data exports to files in formats like JSON, CSV, or Parquet, then uploads to object storage. Object storage costs significantly less than database storage and scales to petabytes. However, data loses queryability without additional tooling. This approach works well for compliance archival where data rarely needs retrieval. When access is needed, applications download files from object storage and process them locally.
Data warehouse consolidation moves archived data to a separate data warehouse optimized for analytical queries. Transactional databases archive old data to analytical systems like Amazon Redshift, Google BigQuery, or Snowflake. The data warehouse provides efficient querying across large historical datasets while removing load from transactional systems. This strategy suits organizations already operating data warehouses and needing analytical access to historical data.
Hybrid tiered storage combines multiple storage systems in a hierarchy. Recent data resides in fast databases, warm data moves to slower database instances, and cold data migrates to object storage. Applications query the appropriate tier based on data age. This approach optimizes cost by matching storage costs to access patterns. Implementation complexity increases due to managing multiple storage systems and query routing logic.
Database table archival with compression keeps archived data in the same database but moves it to separate tables with aggressive compression enabled. PostgreSQL and MySQL support table-level compression that can reduce storage by 70-90% for text-heavy data. Archived tables may be in separate tablespaces on slower storage. This approach maintains query capability while reducing storage costs and production table size. The strategy works well when database migration is impractical but performance improvement is needed.
Microservice-based archival service implements archival as a dedicated service that handles archival for multiple applications. The service provides APIs for archiving, searching, and retrieving data. This centralizes archival logic, improves consistency across applications, and allows specialized optimization. The archival service can manage multiple storage backends, handle format conversions, and enforce retention policies uniformly. This approach suits microservice architectures where multiple services generate data requiring archival.
Ruby Implementation
Ruby applications implement data archival through various libraries and patterns. ActiveRecord provides database-level capabilities while background job frameworks handle the asynchronous nature of archival operations.
Basic ActiveRecord archival moves records from production tables to archival tables. This approach maintains data structure while separating active and archived data.
class Order < ApplicationRecord
def self.archive_old_orders(cutoff_date)
old_orders = where("created_at < ?", cutoff_date)
old_orders.find_each(batch_size: 1000) do |order|
ArchivedOrder.create!(order.attributes.except('id'))
order.destroy
end
end
end
class ArchivedOrder < ApplicationRecord
# Mirror of Order model for archival table
end
Partition-based archival uses PostgreSQL table partitioning through ActiveRecord migrations. This allows entire partitions to be detached for archival.
class CreatePartitionedOrders < ActiveRecord::Migration[7.0]
def up
execute <<-SQL
CREATE TABLE orders (
id bigserial,
created_at timestamp NOT NULL,
-- other columns
PRIMARY KEY (id, created_at)
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
SQL
end
end
class ArchivalService
def archive_partition(year)
ActiveRecord::Base.connection.execute(
"ALTER TABLE orders DETACH PARTITION orders_#{year}"
)
# Move detached partition to archival database
dump_partition(year)
upload_to_s3(year)
drop_partition(year)
end
end
S3 archival with AWS SDK exports data to JSON and uploads to S3 for cost-effective long-term storage.
require 'aws-sdk-s3'
class S3ArchivalService
def initialize
@s3_client = Aws::S3::Client.new(
region: ENV['AWS_REGION'],
access_key_id: ENV['AWS_ACCESS_KEY_ID'],
secret_access_key: ENV['AWS_SECRET_ACCESS_KEY']
)
@bucket = ENV['ARCHIVAL_BUCKET']
end
def archive_orders(start_date, end_date)
orders = Order.where(created_at: start_date..end_date)
file_path = "/tmp/orders_#{start_date}_#{end_date}.json.gz"
Zlib::GzipWriter.open(file_path) do |gz|
orders.find_each do |order|
gz.write(order.to_json)
gz.write("\n")
end
end
key = "orders/#{start_date.year}/#{File.basename(file_path)}"
@s3_client.put_object(
bucket: @bucket,
key: key,
body: File.read(file_path),
storage_class: 'GLACIER_IR' # Instant retrieval Glacier
)
File.delete(file_path)
orders.delete_all
key
end
def retrieve_archived_orders(key)
response = @s3_client.get_object(bucket: @bucket, key: key)
orders = []
Zlib::GzipReader.new(StringIO.new(response.body.read)).each_line do |line|
orders << JSON.parse(line)
end
orders
end
end
Background job archival with Sidekiq processes archival asynchronously to avoid blocking application threads.
class ArchiveOldDataJob
include Sidekiq::Job
sidekiq_options queue: :low_priority, retry: 3
def perform(model_name, cutoff_date)
model = model_name.constantize
batch_size = 1000
loop do
batch = model.where("created_at < ?", cutoff_date)
.limit(batch_size)
break if batch.empty?
batch.each do |record|
archive_record(model_name, record)
end
sleep(1) # Rate limiting
end
end
private
def archive_record(model_name, record)
archival_model = "Archived#{model_name}".constantize
archival_model.create!(record.attributes.except('id'))
record.destroy
end
end
# Schedule monthly archival
class ArchivalScheduler
def self.schedule_monthly_archival
cutoff_date = 6.months.ago
['Order', 'Payment', 'ActivityLog'].each do |model_name|
ArchiveOldDataJob.perform_async(model_name, cutoff_date)
end
end
end
Database dump archival exports data to SQL dumps for offline storage with compression.
class DatabaseDumpArchival
def archive_table_data(table_name, start_date, end_date)
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
filename = "#{table_name}_#{start_date}_#{end_date}_#{timestamp}.sql.gz"
filepath = "/tmp/#{filename}"
# Create compressed dump
command = <<~CMD
pg_dump \
--host=#{db_config['host']} \
--username=#{db_config['username']} \
--table=#{table_name} \
--where="created_at >= '#{start_date}' AND created_at < '#{end_date}'" \
#{db_config['database']} | gzip > #{filepath}
CMD
system(command)
# Upload to archival storage
upload_to_archival_storage(filepath, filename)
# Delete archived data from production
ActiveRecord::Base.connection.execute(
"DELETE FROM #{table_name} WHERE created_at >= '#{start_date}' AND created_at < '#{end_date}'"
)
File.delete(filepath)
end
private
def db_config
ActiveRecord::Base.connection_db_config.configuration_hash
end
def upload_to_archival_storage(filepath, filename)
# Upload to S3, NFS, or other archival storage
end
end
Archival with checksum verification ensures data integrity during archival by verifying checksums before deletion.
require 'digest'
class VerifiedArchivalService
def archive_with_verification(records, storage_adapter)
manifest = {
archived_at: Time.now,
record_count: records.count,
checksums: []
}
records.each do |record|
json_data = record.to_json
checksum = Digest::SHA256.hexdigest(json_data)
storage_adapter.write(record.id, json_data)
manifest[:checksums] << {
id: record.id,
checksum: checksum
}
end
# Verify before deletion
verification_passed = verify_archived_data(manifest, storage_adapter)
if verification_passed
records.each(&:destroy)
storage_adapter.write_manifest(manifest)
{ success: true, archived_count: records.count }
else
{ success: false, error: 'Verification failed' }
end
end
private
def verify_archived_data(manifest, storage_adapter)
manifest[:checksums].all? do |entry|
stored_data = storage_adapter.read(entry[:id])
stored_checksum = Digest::SHA256.hexdigest(stored_data)
stored_checksum == entry[:checksum]
end
end
end
Multi-tier archival service implements progressive archival through multiple storage tiers based on data age.
class MultiTierArchivalService
TIERS = {
warm: { age: 90.days, storage: :fast_db },
cold: { age: 1.year, storage: :s3_standard },
frozen: { age: 3.years, storage: :s3_glacier }
}
def process_tiering
TIERS.each do |tier_name, config|
cutoff_date = config[:age].ago
next_tier = next_tier_for(tier_name)
move_to_tier(tier_name, next_tier, cutoff_date) if next_tier
end
end
private
def move_to_tier(current_tier, next_tier, cutoff_date)
records = fetch_records_for_tier(current_tier, cutoff_date)
records.find_each(batch_size: 500) do |record|
storage_adapter_for(next_tier).store(record)
storage_adapter_for(current_tier).delete(record.id)
end
end
def next_tier_for(tier_name)
tier_order = TIERS.keys
current_index = tier_order.index(tier_name)
tier_order[current_index + 1]
end
def storage_adapter_for(tier_name)
case TIERS[tier_name][:storage]
when :fast_db
DatabaseStorageAdapter.new
when :s3_standard
S3StorageAdapter.new(storage_class: 'STANDARD')
when :s3_glacier
S3StorageAdapter.new(storage_class: 'GLACIER')
end
end
end
Performance Considerations
Data archival decisions significantly affect system performance across multiple dimensions. Understanding performance implications helps design archival strategies that improve production system responsiveness while maintaining data accessibility.
Database query performance improves dramatically when table sizes decrease through archival. Database indexes perform better on smaller tables, query planners generate more efficient execution plans, and table scans complete faster. A table with 100 million rows scanning at 50ms may drop to 10ms after archiving 80% of records. The performance benefit compounds as data continues growing—without archival, query times increase linearly or worse with table size.
Index maintenance overhead decreases with smaller tables. Every insert, update, or delete on indexed columns requires index updates. Large indexes consume more memory and disk I/O. Archiving old data reduces index size, making index updates faster and freeing cache memory for frequently accessed data. Applications seeing high write volumes benefit most from this reduction.
Backup and restore times decrease proportionally to database size reduction. A database backup taking six hours might complete in two hours after archiving. Restore operations similarly benefit. Faster backups and restores improve disaster recovery objectives and reduce maintenance windows. Archival also reduces backup storage costs since smaller databases generate smaller backup files.
Archival operation performance requires careful management to avoid impacting production workloads. Bulk deletion of archived records can lock tables, preventing concurrent access. Breaking archival into small batches with delays between batches spreads the load over time. Performing archival during low-traffic periods minimizes user impact. Network bandwidth becomes a constraint when transferring large data volumes to remote archival storage.
Data transfer bottlenecks emerge when moving large datasets. Transferring terabytes of data to cloud storage may saturate network connections or consume transfer quotas. Compression before transfer reduces bandwidth requirements but adds CPU overhead. Estimating transfer times helps schedule archival operations appropriately. A 1TB dataset over a 1Gbps connection requires minimum 2.5 hours without considering protocol overhead.
Storage I/O patterns affect archival performance differently across storage types. Object storage handles large sequential writes efficiently but struggles with small random writes. Database storage performs better with small transactions but may slow with large bulk operations. Archival strategies should match their I/O patterns to storage characteristics. Writing to object storage in large batches performs better than individual record writes.
Retrieval performance trade-offs vary by storage tier. Amazon S3 Standard provides millisecond retrieval while Glacier Deep Archive requires 12-48 hours. The cost savings of cheaper storage must justify retrieval delays for the use case. Applications needing occasional access to archived data should avoid ultra-cold storage tiers. Understanding retrieval SLAs prevents choosing archival strategies incompatible with business requirements.
Query performance across archived and active data presents challenges. Queries spanning active and archived data may need to query multiple systems and merge results. This increases latency and complexity compared to single-system queries. Materializing archived data back to production databases for analysis defeats archival purposes. Data warehouse solutions mitigate this by providing unified query interfaces across storage tiers.
Compression impact reduces storage costs and transfer times but adds CPU overhead. Compressing archived data before storage decreases costs by 60-90% for text data. Decompression during retrieval adds latency. Applications should balance compression ratio against CPU cost. Streaming compression during archival transfers prevents storing uncompressed temporary files.
Connection pool saturation can occur during archival operations if too many concurrent database connections perform archival queries. Archival processes should use separate connection pools from application connections or limit concurrent archival workers. Saturating connection pools degrades application performance during archival windows.
Memory consumption during archival requires monitoring. Loading large result sets into memory for processing can exhaust available RAM. Streaming approaches that process records individually or in small batches prevent memory exhaustion. Ruby's find_each method in ActiveRecord provides efficient batch processing that maintains constant memory usage regardless of dataset size.
Tools & Ecosystem
Multiple tools and services facilitate data archival implementation. Ruby applications can integrate these tools through libraries and APIs to build comprehensive archival solutions.
AWS S3 and Glacier provide object storage with multiple tiers for different access patterns. S3 Standard offers immediate access, S3 Infrequent Access reduces costs for less frequent access, Glacier Flexible Retrieval provides low-cost storage with hours-long retrieval, and Glacier Deep Archive offers the lowest cost with 12-48 hour retrieval. The aws-sdk-s3 gem provides Ruby integration.
require 'aws-sdk-s3'
client = Aws::S3::Client.new(region: 'us-east-1')
# Upload with lifecycle transition
client.put_object(
bucket: 'archival-bucket',
key: 'orders/2020/data.json.gz',
body: compressed_data,
storage_class: 'STANDARD_IA'
)
# Configure bucket lifecycle
client.put_bucket_lifecycle_configuration(
bucket: 'archival-bucket',
lifecycle_configuration: {
rules: [{
id: 'archive-old-data',
status: 'Enabled',
transitions: [
{ days: 90, storage_class: 'GLACIER' },
{ days: 365, storage_class: 'DEEP_ARCHIVE' }
]
}]
}
)
PostgreSQL table partitioning manages data archival through declarative partitioning. Tables partition by range, list, or hash. Partitions can be detached for archival or dropped entirely. The pg gem provides Ruby interface to PostgreSQL-specific features.
MySQL archival storage engine provides transparent compression for archived tables. The ARCHIVE storage engine compresses rows as they're inserted and doesn't support indexes or updates, making it suitable for append-only archival data.
Sidekiq and delayed_job provide background job processing for asynchronous archival operations. These libraries handle job scheduling, retry logic, and monitoring, preventing archival from blocking application threads.
Logrotate manages log file archival on Linux systems. Configuration files specify rotation frequency, compression, and retention periods. Applications writing to log files benefit from logrotate's automatic archival.
# /etc/logrotate.d/rails_app
/var/log/rails_app/*.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
create 0640 rails rails
sharedscripts
postrotate
systemctl reload rails_app
endscript
}
Elasticsearch curator manages time-series index archival in Elasticsearch clusters. Curator can close, delete, or snapshot old indices based on age or size criteria. This prevents Elasticsearch clusters from growing unbounded with historical data.
Apache Parquet provides columnar storage format efficient for analytical workloads. Archiving data to Parquet format enables efficient querying in data warehouses while achieving high compression ratios. The parquet gem offers Ruby support.
SQLite for archival databases works well for smaller archival datasets requiring occasional queries. Each archived period can store in separate SQLite files, providing simple file-based archival with full SQL query capability.
require 'sqlite3'
def create_archival_database(year)
db = SQLite3::Database.new("orders_archive_#{year}.db")
db.execute <<-SQL
CREATE TABLE orders (
id INTEGER PRIMARY KEY,
created_at TEXT,
customer_id INTEGER,
total_amount DECIMAL,
data TEXT
)
SQL
db.execute "CREATE INDEX idx_created_at ON orders(created_at)"
db
end
def archive_to_sqlite(year)
archive_db = create_archival_database(year)
Order.where("strftime('%Y', created_at) = ?", year.to_s)
.find_each(batch_size: 1000) do |order|
archive_db.execute(
"INSERT INTO orders VALUES (?, ?, ?, ?, ?)",
[order.id, order.created_at, order.customer_id,
order.total_amount, order.to_json]
)
end
archive_db.close
end
Database backup tools like pg_dump and mysqldump export data for archival. These tools support filtering by table or conditions, compression, and parallel export. The backed-up files serve as archival storage.
Cloud provider lifecycle policies automate storage tier transitions. AWS S3, Google Cloud Storage, and Azure Blob Storage support lifecycle rules that automatically move objects between storage tiers based on age or access patterns. This reduces operational overhead for multi-tier archival.
Data warehouse solutions like Amazon Redshift, Snowflake, and Google BigQuery provide archival storage with analytical query capability. These systems optimize for infrequent writes and complex analytical queries across large datasets.
Compression libraries reduce archival storage requirements. The zlib, bzip2, and lz4 gems provide different compression algorithms with varying trade-offs between compression ratio, speed, and CPU usage. Gzip compression typically achieves 10x reduction for text data.
Monitoring tools track archival operation health. Prometheus and Datadog can monitor archival job success rates, processing times, data volumes, and errors. Alerting on archival failures prevents compliance issues from unnoticed archival problems.
Reference
Archival Strategy Comparison
| Strategy | Access Speed | Cost | Query Capability | Complexity |
|---|---|---|---|---|
| Archival table | Fast | Medium | Full SQL | Low |
| Separate database | Fast | Medium | Full SQL | Medium |
| Object storage | Slow | Low | None | Medium |
| Data warehouse | Medium | Medium | Analytical | High |
| Compressed dump | Very slow | Very low | None | Low |
| Hybrid tiered | Variable | Low | Limited | High |
Storage Tier Characteristics
| Tier | Retrieval Time | Cost per GB/month | Use Case |
|---|---|---|---|
| S3 Standard | Immediate | $0.023 | Warm data |
| S3 Standard-IA | Immediate | $0.0125 | Infrequent access |
| S3 Glacier IR | Immediate | $0.004 | Archival with instant access |
| S3 Glacier Flexible | 1-5 hours | $0.0036 | Rarely accessed archival |
| S3 Deep Archive | 12-48 hours | $0.00099 | Long-term compliance |
Archival Decision Matrix
| Data Age | Access Frequency | Recommended Strategy |
|---|---|---|
| < 90 days | High | Production database |
| 90-365 days | Medium | Separate database or warm storage |
| 1-3 years | Low | Object storage or cold tier |
| > 3 years | Very low | Deep archive or tape |
Common Retention Policies
| Industry | Data Type | Typical Retention |
|---|---|---|
| Financial | Transaction records | 7 years |
| Healthcare | Patient records | 6 years after last visit |
| E-commerce | Order history | 3-7 years |
| SaaS | User activity logs | 90 days to 1 year |
| Government | Public records | Permanent |
ActiveRecord Archival Methods
| Method | Purpose | Performance Impact |
|---|---|---|
| find_each | Batch iteration | Low memory usage |
| delete_all | Bulk deletion | Fast but no callbacks |
| destroy_all | Individual deletion | Slow with callbacks |
| insert_all | Bulk insertion | Fast without callbacks |
| transaction | Atomic operations | Ensures consistency |
PostgreSQL Partition Commands
| Command | Purpose |
|---|---|
| CREATE TABLE ... PARTITION BY RANGE | Create partitioned table |
| CREATE TABLE ... PARTITION OF | Create partition |
| ALTER TABLE DETACH PARTITION | Remove partition from parent |
| ALTER TABLE ATTACH PARTITION | Add existing table as partition |
| DROP TABLE | Delete detached partition |
Compression Algorithm Comparison
| Algorithm | Compression Ratio | Speed | CPU Usage | Gem |
|---|---|---|---|---|
| Gzip | 70-80% | Medium | Medium | zlib |
| Bzip2 | 80-90% | Slow | High | bzip2-ffi |
| LZ4 | 50-60% | Very fast | Low | lz4-ruby |
| Zstd | 70-80% | Fast | Low | zstd-ruby |
Archival Process Checklist
| Step | Verification |
|---|---|
| Define retention policy | Document business and compliance requirements |
| Identify archival candidates | Analyze access patterns and data age |
| Choose storage tier | Match cost and access needs |
| Implement archival logic | Write and test archival code |
| Verify data integrity | Checksum validation before deletion |
| Test restoration | Ensure archived data retrievable |
| Monitor and alert | Track success rates and failures |
| Document procedures | Maintain runbooks for operations |
S3 Lifecycle Configuration
| Parameter | Description | Example Value |
|---|---|---|
| ID | Rule identifier | archive-policy-001 |
| Status | Rule state | Enabled |
| Prefix | Object path filter | orders/2020/ |
| Transitions | Storage class changes | 90 days to Glacier |
| Expiration | Deletion timing | 2555 days |
| NoncurrentVersionExpiration | Delete old versions | 30 days |
Ruby Archival Gems
| Gem | Purpose | Features |
|---|---|---|
| aws-sdk-s3 | S3 integration | Multi-tier storage, lifecycle policies |
| pg | PostgreSQL driver | Partition support, native types |
| sqlite3 | SQLite integration | File-based archival databases |
| sidekiq | Background jobs | Async archival processing |
| zlib | Compression | Gzip compression/decompression |
| parallel | Parallel processing | Multi-threaded archival |
Monitoring Metrics
| Metric | Purpose | Alert Threshold |
|---|---|---|
| Archival success rate | Track failures | < 95% |
| Processing time | Detect slowdowns | > 2x baseline |
| Data volume archived | Capacity planning | Monitor trends |
| Storage costs | Budget management | > threshold |
| Restoration time | SLA compliance | > defined SLA |
| Error count | Operational health | > 5 per hour |