Overview
Backup and recovery strategies form the operational foundation for data protection in software systems. A backup creates a copy of data at a specific point in time, while recovery restores that data when the original becomes corrupted, deleted, or inaccessible. The strategy defines how often backups occur, what data gets backed up, where backups are stored, and how quickly systems can recover.
Organizations lose data through hardware failures, software bugs, human error, malicious attacks, and natural disasters. Without backups, data loss results in business disruption, regulatory penalties, and potential business failure. The cost of data loss often exceeds the cost of maintaining backups by orders of magnitude.
Backup strategies balance multiple objectives: minimizing data loss (measured by Recovery Point Objective or RPO), minimizing downtime (measured by Recovery Time Objective or RTO), managing storage costs, and maintaining backup integrity. Different systems require different strategies based on their criticality, data volume, and change frequency.
A database handling financial transactions might need continuous replication with sub-second RPO and RTO, while archived log files might tolerate daily backups with hours of recovery time. The strategy must account for data size, network bandwidth, backup windows, and regulatory retention requirements.
# Basic backup configuration example
backup_config = {
source: '/var/app/data',
destination: 's3://backups/app-data',
frequency: '0 2 * * *', # 2 AM daily
retention: 30, # days
compression: true,
encryption: true
}
Recovery scenarios include complete system restoration after catastrophic failure, point-in-time recovery to undo changes, selective file restoration, and recovery testing to verify backup integrity. Each scenario requires different capabilities from the backup system.
Key Principles
Recovery Point Objective (RPO) defines the maximum acceptable data loss measured in time. An RPO of 1 hour means the system can tolerate losing up to 1 hour of data. RPO determines backup frequency: a 1-hour RPO requires backups at least hourly. Systems with zero RPO need continuous replication or synchronous writes to multiple locations.
Recovery Time Objective (RTO) defines the maximum acceptable downtime. An RTO of 4 hours means the system must be operational within 4 hours of failure. RTO influences backup format, storage location, and recovery automation. Shorter RTOs require faster storage, more automation, and potentially hot standby systems.
Backup types differ in what data they capture and how long they take:
Full backups copy all data regardless of whether it changed. They provide complete system state and simplify recovery but require the most storage and time. A full backup might take 8 hours for a 10TB database.
Incremental backups copy only data that changed since the last backup of any type. They minimize backup time and storage but complicate recovery because restoration requires the last full backup plus all incremental backups since then. Recovering from incremental backups takes longer due to processing multiple backup sets.
Differential backups copy data that changed since the last full backup. They balance storage efficiency and recovery speed: recovery requires only the last full backup plus the most recent differential backup.
The 3-2-1 rule provides a foundational strategy: maintain 3 copies of data (1 primary + 2 backups), store backups on 2 different media types, keep 1 backup copy offsite. This protects against multiple failure scenarios including hardware failure, site disasters, and ransomware attacks.
Backup integrity verification confirms backups are usable. Verification includes checksum validation during backup, restoration testing to confirm data can be recovered, and corruption detection through periodic validation. An untested backup is not a backup.
Retention policies define how long backups are kept. Regulatory requirements often mandate specific retention periods: financial records might need 7 years, healthcare data might need 10 years. Retention balances storage costs against recovery needs and compliance requirements. Many strategies use a tiered approach: daily backups for 30 days, weekly backups for 6 months, monthly backups for 7 years.
# Retention policy implementation
class RetentionPolicy
def initialize
@daily_retention = 30
@weekly_retention = 180
@monthly_retention = 2555 # ~7 years
end
def should_keep?(backup)
age_days = (Date.today - backup.date).to_i
return true if age_days < @daily_retention
return true if backup.weekly? && age_days < @weekly_retention
return true if backup.monthly? && age_days < @monthly_retention
false
end
end
Consistency requirements ensure backups represent valid system state. Application-level consistency captures data at a transactionally consistent point. Crash consistency captures filesystem state as it exists at backup time without application coordination. Database backups often require quiescing writes or using snapshot mechanisms that maintain transaction consistency.
Backup windows define acceptable times for backup operations that might impact performance. A backup window of 2 AM to 6 AM provides 4 hours for backup operations without affecting business hours. Systems with continuous operation require backup methods that don't require downtime, such as continuous replication or snapshot-based backups.
Implementation Approaches
Full backup strategy copies all data on each backup operation. This approach simplifies recovery because restoration requires only a single backup set. Full backups work well for small datasets, systems with infrequent changes, or scenarios where storage costs are minimal.
Implementation creates a complete copy at each interval:
class FullBackupStrategy
def initialize(source, destination)
@source = source
@destination = destination
end
def backup
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
backup_path = "#{@destination}/full_#{timestamp}"
# Copy entire source to timestamped backup location
FileUtils.cp_r(@source, backup_path)
create_manifest(backup_path)
verify_backup(backup_path)
backup_path
end
def restore(backup_path, restore_location)
FileUtils.cp_r(backup_path, restore_location)
verify_restore(restore_location)
end
end
Incremental backup strategy copies only files that changed since the last backup of any type. After an initial full backup, subsequent backups capture only changes. This minimizes backup time and storage but increases recovery complexity.
Recovery requires the base full backup plus all incremental backups in sequence. A corrupted incremental backup in the chain breaks recovery for all subsequent backups.
class IncrementalBackupStrategy
def initialize(source, destination, state_file)
@source = source
@destination = destination
@state_file = state_file
@last_backup_time = load_last_backup_time
end
def backup
timestamp = Time.now.to_i
backup_path = "#{@destination}/incr_#{timestamp}"
changed_files = find_changed_files(@source, @last_backup_time)
copy_files(changed_files, backup_path)
save_backup_time(timestamp)
create_manifest(backup_path, changed_files)
backup_path
end
def find_changed_files(dir, since_time)
Dir.glob("#{dir}/**/*").select do |file|
File.file?(file) && File.mtime(file).to_i > since_time
end
end
def restore(full_backup, incremental_backups, restore_location)
# Restore full backup first
FileUtils.cp_r(full_backup, restore_location)
# Apply incremental backups in order
incremental_backups.sort.each do |incr_backup|
apply_incremental(incr_backup, restore_location)
end
end
end
Differential backup strategy copies all files that changed since the last full backup. Each differential backup grows larger over time but recovery requires only the full backup plus the most recent differential.
Differential backups balance storage efficiency and recovery simplicity. They take longer than incremental backups but provide faster recovery and more resilience to backup corruption.
Continuous data protection (CDP) captures every change as it occurs, providing the finest possible RPO. CDP systems maintain journals of all writes and can restore to any point in time. This approach requires significant storage and processing but provides maximum data protection.
Database replication, filesystem journaling, and application-level change tracking implement CDP. The system maintains a base snapshot plus a log of all changes, allowing reconstruction of state at any moment.
Snapshot-based backups capture filesystem or storage state at a specific instant without copying all data immediately. Copy-on-write snapshots preserve the original data blocks when changes occur, allowing the snapshot to represent state at snapshot time.
Storage systems like ZFS, Btrfs, and LVM provide snapshot capabilities. Application-consistent snapshots coordinate with applications to flush buffers and pause writes during snapshot creation.
# Snapshot-based backup using LVM
class LVMSnapshotBackup
def create_snapshot(volume, snapshot_name)
size = "10G" # Snapshot storage allocation
system("lvcreate -L #{size} -s -n #{snapshot_name} #{volume}")
mount_point = "/mnt/snapshots/#{snapshot_name}"
FileUtils.mkdir_p(mount_point)
system("mount /dev/vg/#{snapshot_name} #{mount_point}")
mount_point
end
def backup_snapshot(mount_point, destination)
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
archive = "#{destination}/snapshot_#{timestamp}.tar.gz"
system("tar czf #{archive} -C #{mount_point} .")
archive
end
def cleanup_snapshot(snapshot_name, mount_point)
system("umount #{mount_point}")
system("lvremove -f /dev/vg/#{snapshot_name}")
FileUtils.rm_rf(mount_point)
end
end
Cloud-based backup strategies store data in cloud object storage like S3, providing durability, geographic distribution, and scalability. Cloud backups offer offsite storage without managing physical infrastructure. Multi-region replication protects against regional failures.
Cloud strategies often use lifecycle policies to transition older backups to cheaper storage tiers automatically. Recent backups stay in hot storage for fast recovery while older backups move to cold storage for cost efficiency.
Replication-based strategies maintain synchronized copies of data across multiple systems. Synchronous replication confirms writes to all replicas before acknowledging the write, providing zero data loss but higher latency. Asynchronous replication acknowledges writes immediately and replicates in the background, reducing latency but allowing potential data loss during failures.
Ruby Implementation
Ruby applications implement backups through filesystem operations, external command execution, database-specific tools, and cloud service APIs. The implementation handles file copying, compression, encryption, and integrity verification.
Basic file backup implementation:
require 'fileutils'
require 'zlib'
require 'digest'
require 'json'
class FileBackup
def initialize(source_dir, backup_dir)
@source_dir = source_dir
@backup_dir = backup_dir
FileUtils.mkdir_p(@backup_dir)
end
def create_backup(backup_name = nil)
backup_name ||= "backup_#{Time.now.strftime('%Y%m%d_%H%M%S')}"
backup_path = File.join(@backup_dir, backup_name)
# Create compressed archive
tar_file = "#{backup_path}.tar.gz"
Dir.chdir(File.dirname(@source_dir)) do
source_name = File.basename(@source_dir)
system("tar czf #{tar_file} #{source_name}")
end
# Generate checksum
checksum = calculate_checksum(tar_file)
# Create metadata
metadata = {
backup_name: backup_name,
timestamp: Time.now.iso8601,
source: @source_dir,
size: File.size(tar_file),
checksum: checksum
}
File.write("#{tar_file}.json", JSON.pretty_generate(metadata))
metadata
end
def restore_backup(backup_name, restore_dir)
tar_file = File.join(@backup_dir, "#{backup_name}.tar.gz")
metadata_file = "#{tar_file}.json"
# Verify checksum
unless verify_checksum(tar_file, metadata_file)
raise "Backup checksum verification failed"
end
# Extract archive
FileUtils.mkdir_p(restore_dir)
system("tar xzf #{tar_file} -C #{restore_dir}")
true
end
def list_backups
Dir.glob(File.join(@backup_dir, "*.tar.gz.json")).map do |metadata_file|
JSON.parse(File.read(metadata_file))
end.sort_by { |m| m['timestamp'] }.reverse
end
private
def calculate_checksum(file)
Digest::SHA256.file(file).hexdigest
end
def verify_checksum(tar_file, metadata_file)
metadata = JSON.parse(File.read(metadata_file))
expected = metadata['checksum']
actual = calculate_checksum(tar_file)
expected == actual
end
end
Database backup implementation uses database-specific tools and Ruby's command execution:
class PostgreSQLBackup
def initialize(config)
@host = config[:host]
@database = config[:database]
@username = config[:username]
@backup_dir = config[:backup_dir]
FileUtils.mkdir_p(@backup_dir)
end
def create_backup
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
backup_file = File.join(@backup_dir, "#{@database}_#{timestamp}.sql")
compressed_file = "#{backup_file}.gz"
# Create database dump
env = { 'PGPASSWORD' => ENV['PGPASSWORD'] }
cmd = [
'pg_dump',
'-h', @host,
'-U', @username,
'-F', 'p', # Plain SQL format
'-f', backup_file,
@database
]
unless system(env, *cmd)
raise "Database backup failed"
end
# Compress backup
compress_file(backup_file, compressed_file)
File.delete(backup_file)
# Generate metadata
metadata = {
database: @database,
timestamp: timestamp,
size: File.size(compressed_file),
checksum: Digest::SHA256.file(compressed_file).hexdigest
}
File.write("#{compressed_file}.json", JSON.pretty_generate(metadata))
compressed_file
end
def restore_backup(backup_file, target_database = nil)
target_database ||= @database
# Decompress backup
sql_file = backup_file.sub(/\.gz$/, '')
decompress_file(backup_file, sql_file)
# Restore database
env = { 'PGPASSWORD' => ENV['PGPASSWORD'] }
cmd = [
'psql',
'-h', @host,
'-U', @username,
'-d', target_database,
'-f', sql_file
]
success = system(env, *cmd)
File.delete(sql_file)
raise "Database restore failed" unless success
true
end
private
def compress_file(input, output)
Zlib::GzipWriter.open(output) do |gz|
File.open(input, 'rb') do |file|
while chunk = file.read(1024 * 1024)
gz.write(chunk)
end
end
end
end
def decompress_file(input, output)
Zlib::GzipReader.open(input) do |gz|
File.open(output, 'wb') do |file|
while chunk = gz.read(1024 * 1024)
file.write(chunk)
end
end
end
end
end
AWS S3 backup implementation using the AWS SDK:
require 'aws-sdk-s3'
class S3Backup
def initialize(bucket_name, region = 'us-east-1')
@bucket_name = bucket_name
@s3 = Aws::S3::Client.new(region: region)
end
def upload_backup(local_file, s3_key = nil)
s3_key ||= File.basename(local_file)
# Upload with multipart for large files
File.open(local_file, 'rb') do |file|
@s3.put_object(
bucket: @bucket_name,
key: s3_key,
body: file,
server_side_encryption: 'AES256',
storage_class: 'STANDARD_IA' # Infrequent access
)
end
# Verify upload
head = @s3.head_object(bucket: @bucket_name, key: s3_key)
{
key: s3_key,
size: head.content_length,
etag: head.etag,
last_modified: head.last_modified
}
end
def download_backup(s3_key, local_file)
FileUtils.mkdir_p(File.dirname(local_file))
File.open(local_file, 'wb') do |file|
@s3.get_object(
bucket: @bucket_name,
key: s3_key
) do |chunk|
file.write(chunk)
end
end
local_file
end
def list_backups(prefix = '')
response = @s3.list_objects_v2(
bucket: @bucket_name,
prefix: prefix
)
response.contents.map do |obj|
{
key: obj.key,
size: obj.size,
last_modified: obj.last_modified,
storage_class: obj.storage_class
}
end
end
def apply_lifecycle_policy
@s3.put_bucket_lifecycle_configuration(
bucket: @bucket_name,
lifecycle_configuration: {
rules: [
{
id: 'archive-old-backups',
status: 'Enabled',
transitions: [
{
days: 30,
storage_class: 'GLACIER'
},
{
days: 90,
storage_class: 'DEEP_ARCHIVE'
}
],
expiration: {
days: 2555 # ~7 years
}
}
]
}
)
end
end
Automated backup scheduler runs backups on defined intervals:
require 'rufus-scheduler'
class BackupScheduler
def initialize(backup_strategy)
@backup_strategy = backup_strategy
@scheduler = Rufus::Scheduler.new
end
def schedule_daily(time = '02:00')
@scheduler.cron "0 2 * * *" do
perform_backup('daily')
end
end
def schedule_weekly(day = 'sunday', time = '03:00')
@scheduler.cron "0 3 * * 0" do
perform_backup('weekly')
end
end
def schedule_monthly(day = 1, time = '04:00')
@scheduler.cron "0 4 #{day} * *" do
perform_backup('monthly')
end
end
def start
@scheduler.join
end
private
def perform_backup(frequency)
start_time = Time.now
begin
result = @backup_strategy.create_backup
duration = Time.now - start_time
log_backup_success(frequency, result, duration)
notify_success(frequency, result)
rescue => e
log_backup_failure(frequency, e)
notify_failure(frequency, e)
end
end
def log_backup_success(frequency, result, duration)
puts "[#{Time.now}] #{frequency} backup completed: #{result[:size]} bytes in #{duration}s"
end
def log_backup_failure(frequency, error)
puts "[#{Time.now}] #{frequency} backup failed: #{error.message}"
end
end
Tools & Ecosystem
Backup software provides functionality beyond basic file copying. These tools handle scheduling, retention, compression, encryption, deduplication, and verification.
Restic creates encrypted, deduplicated backups to local storage or cloud providers. Restic stores data in content-addressable format, where identical chunks are stored once regardless of which files contain them. This reduces storage requirements dramatically for systems with repeated data.
class ResticBackup
def initialize(repository, password)
@repository = repository
@password = password
ENV['RESTIC_PASSWORD'] = password
end
def backup(paths, tags = [])
tag_args = tags.flat_map { |tag| ['--tag', tag] }
cmd = [
'restic',
'-r', @repository,
'backup',
*tag_args,
*paths
]
output = `#{cmd.join(' ')} 2>&1`
unless $?.success?
raise "Restic backup failed: #{output}"
end
parse_restic_output(output)
end
def restore(snapshot_id, target)
cmd = [
'restic',
'-r', @repository,
'restore',
snapshot_id,
'--target', target
]
system(*cmd)
end
def list_snapshots
output = `restic -r #{@repository} snapshots --json`
JSON.parse(output)
end
def forget_old_backups
cmd = [
'restic',
'-r', @repository,
'forget',
'--keep-daily', '30',
'--keep-weekly', '24',
'--keep-monthly', '84',
'--prune'
]
system(*cmd)
end
end
Borg Backup provides deduplicated, compressed, and authenticated backups. Borg excels at backing up to remote servers via SSH and provides efficient incremental backups through content-defined chunking.
Ruby gems for backups:
The backup gem provides a DSL for defining backup strategies. It supports multiple storage backends, databases, and notification methods.
# Using the backup gem
require 'backup'
Backup::Model.new(:database_backup, 'Production Database') do
database PostgreSQL do |db|
db.name = 'production_db'
db.username = 'backup_user'
db.password = ENV['DB_PASSWORD']
db.host = 'localhost'
end
store_with S3 do |s3|
s3.access_key_id = ENV['AWS_ACCESS_KEY']
s3.secret_access_key = ENV['AWS_SECRET_KEY']
s3.bucket = 'app-backups'
s3.region = 'us-east-1'
s3.path = 'database'
s3.keep = 30
end
compress_with Gzip
notify_by Mail do |mail|
mail.on_success = false
mail.on_failure = true
mail.from = 'backups@example.com'
mail.to = 'ops@example.com'
end
end
The aws-sdk-s3 gem provides full S3 API access for cloud backups. The google-cloud-storage gem offers similar functionality for Google Cloud Storage.
Database-specific tools:
PostgreSQL uses pg_dump for logical backups and pg_basebackup for physical backups. MySQL uses mysqldump for logical backups and supports binary log replication. MongoDB uses mongodump and mongorestore.
Monitoring and alerting tools track backup success and send notifications on failure. Services like PagerDuty, Datadog, and custom monitoring scripts ensure backup operations complete successfully.
# Backup monitoring with health checks
class BackupMonitor
def initialize(health_check_url)
@health_check_url = health_check_url
end
def ping_start
HTTP.get("#{@health_check_url}/start")
end
def ping_success
HTTP.get(@health_check_url)
end
def ping_failure(error)
HTTP.post(@health_check_url, json: {
status: 'failure',
error: error.message
})
end
end
Security Implications
Encryption at rest protects backup data from unauthorized access when stored. Backups often contain sensitive customer data, credentials, and proprietary information that require protection equal to or greater than production data.
Encrypt backups using strong algorithms like AES-256 before transmitting to storage. The encryption key must be protected separately from the backup data. Key management services like AWS KMS or HashiCorp Vault store encryption keys securely.
require 'openssl'
require 'base64'
class EncryptedBackup
def initialize(encryption_key)
@encryption_key = encryption_key
end
def encrypt_file(input_file, output_file)
cipher = OpenSSL::Cipher.new('AES-256-CBC')
cipher.encrypt
cipher.key = derive_key(@encryption_key)
iv = cipher.random_iv
File.open(output_file, 'wb') do |out|
# Write IV first (not secret)
out.write(iv)
File.open(input_file, 'rb') do |input|
while chunk = input.read(1024 * 1024)
encrypted = cipher.update(chunk)
out.write(encrypted)
end
out.write(cipher.final)
end
end
end
def decrypt_file(input_file, output_file)
decipher = OpenSSL::Cipher.new('AES-256-CBC')
decipher.decrypt
decipher.key = derive_key(@encryption_key)
File.open(output_file, 'wb') do |out|
File.open(input_file, 'rb') do |input|
# Read IV from file
iv = input.read(16)
decipher.iv = iv
while chunk = input.read(1024 * 1024)
decrypted = decipher.update(chunk)
out.write(decrypted)
end
out.write(decipher.final)
end
end
end
private
def derive_key(password)
OpenSSL::PKCS5.pbkdf2_hmac(
password,
'backup-salt', # Should be unique per backup
10000,
32
)
end
end
Encryption in transit protects data while transferring to backup storage. Use TLS for network transfers and verify certificates to prevent man-in-the-middle attacks. Cloud storage APIs typically provide TLS by default, but verify configuration.
Access control limits who can create, access, and delete backups. Principle of least privilege applies: backup systems need read access to production data but humans should not access backups without specific authorization.
Implement role-based access control (RBAC) for backup systems. Separate credentials for backup creation, backup restoration, and backup deletion. An attacker compromising production systems should not automatically gain access to backups.
Backup immutability prevents modification or deletion of backups for a specified period. Immutable backups protect against ransomware that tries to encrypt or delete backups before demanding ransom. S3 Object Lock and similar features prevent deletion even with full administrative credentials.
# S3 backup with object lock for immutability
class ImmutableBackup
def initialize(bucket_name)
@bucket_name = bucket_name
@s3 = Aws::S3::Client.new
end
def enable_object_lock
@s3.put_object_lock_configuration(
bucket: @bucket_name,
object_lock_configuration: {
object_lock_enabled: 'Enabled',
rule: {
default_retention: {
mode: 'GOVERNANCE', # or 'COMPLIANCE'
days: 30
}
}
}
)
end
def upload_immutable_backup(file, key)
File.open(file, 'rb') do |f|
@s3.put_object(
bucket: @bucket_name,
key: key,
body: f,
object_lock_mode: 'GOVERNANCE',
object_lock_retain_until_date: Time.now + (30 * 86400)
)
end
end
end
Credential management requires careful handling. Backup systems need credentials for databases, storage systems, and cloud services. Never hardcode credentials in backup scripts. Use environment variables, secrets management systems, or instance profiles.
Audit logging tracks backup operations for security monitoring and compliance. Log backup creation, restoration attempts, access to backup data, and backup deletions. Include timestamp, user, operation type, and result in audit logs.
Data retention and disposal must comply with regulations. Some regulations require specific retention periods, others mandate secure deletion after retention expires. Implement automated expiration policies and cryptographic erasure (destroying encryption keys) for secure deletion.
Common Pitfalls
Untested backups represent the most critical failure. Organizations discover backup problems during recovery attempts when data loss has already occurred. Regular restore testing verifies backups are valid and recovery procedures work.
Schedule quarterly or monthly restore tests to random servers or test environments. Measure actual RTO during tests and compare against objectives. Document any discrepancies and update procedures.
class BackupValidator
def validate_backup(backup_file)
errors = []
# Check file exists
unless File.exist?(backup_file)
errors << "Backup file not found"
return errors
end
# Check file size
if File.size(backup_file) == 0
errors << "Backup file is empty"
end
# Verify checksum
metadata_file = "#{backup_file}.json"
if File.exist?(metadata_file)
unless verify_integrity(backup_file, metadata_file)
errors << "Checksum verification failed"
end
else
errors << "Metadata file missing"
end
# Test restoration to temporary location
temp_dir = "/tmp/backup_test_#{Time.now.to_i}"
begin
restore_backup(backup_file, temp_dir)
errors << "Restore test failed" unless verify_restore(temp_dir)
rescue => e
errors << "Restore exception: #{e.message}"
ensure
FileUtils.rm_rf(temp_dir) if Dir.exist?(temp_dir)
end
errors
end
end
Insufficient backup frequency leads to excessive data loss. A database with hourly changes backed up daily has up to 24 hours of data loss risk. Measure actual change rates and set backup frequency accordingly.
Backup and source on same storage fails to protect against storage failures. A backup on the same disk as production data provides no protection when that disk fails. The 3-2-1 rule requires geographically separate storage.
Ignoring backup monitoring allows backup failures to continue undetected. Silent failures accumulate until recovery is attempted and no valid backups exist. Implement active monitoring with alerts for backup failures.
Inadequate retention periods delete backups before they might be needed. Corruption discovered weeks after it occurred requires backups older than daily retention. Balance storage costs against recovery scenarios including delayed corruption detection.
Backup corruption during creation produces invalid backups that appear successful. Network interruptions, disk errors, or software bugs corrupt backup data. Verify backups immediately after creation using checksums and test restorations.
Forgetting incremental backup dependencies leads to incomplete recovery. Deleting an incremental backup in a chain breaks recovery for all subsequent backups. Retention policies must consider backup dependencies.
Performance impact on production occurs when backups consume excessive CPU, memory, or I/O. Schedule intensive backup operations during low-traffic periods or use techniques like snapshots that minimize impact.
Incomplete application consistency produces backups with inconsistent state. Backing up a database while writes occur can capture data mid-transaction. Use application-specific backup tools or quiesce writes during backup.
Single point of failure in backup infrastructure eliminates protection when backup servers fail. A single backup server processing all backups becomes a bottleneck and single point of failure. Distribute backup operations and maintain redundancy.
Encryption key loss makes encrypted backups unrecoverable. Organizations implementing encryption without proper key management permanently lose access to backups when keys are lost. Store encryption keys separately from backups with appropriate redundancy.
Reference
Backup Strategy Comparison
| Strategy | Storage Required | Backup Speed | Recovery Speed | Recovery Complexity |
|---|---|---|---|---|
| Full | Highest | Slowest | Fastest | Simple |
| Incremental | Lowest | Fastest | Slowest | Complex |
| Differential | Medium | Medium | Medium | Moderate |
| Snapshot | Low | Very Fast | Fast | Simple |
| Continuous | High | Continuous | Very Fast | Moderate |
RTO and RPO Guidelines
| Criticality Level | Typical RPO | Typical RTO | Backup Frequency | Storage Type |
|---|---|---|---|---|
| Critical (Tier 1) | Minutes | Minutes | Continuous/Hourly | Replicated |
| High (Tier 2) | Hours | Hours | Hourly | Hot storage |
| Medium (Tier 3) | 24 hours | 8 hours | Daily | Warm storage |
| Low (Tier 4) | Days | 24+ hours | Weekly | Cold storage |
Retention Policy Examples
| Backup Type | Frequency | Keep Duration | Use Case |
|---|---|---|---|
| Transaction logs | Continuous | 7 days | Point-in-time recovery |
| Daily full | Daily | 30 days | Recent recovery |
| Weekly full | Weekly | 12 weeks | Medium-term recovery |
| Monthly full | Monthly | 7 years | Compliance, long-term |
| Yearly archive | Yearly | Indefinite | Historical reference |
Backup Tools and Gems
| Tool/Gem | Type | Primary Use | Key Feature |
|---|---|---|---|
| restic | System | File backup | Deduplication, encryption |
| borg | System | File backup | Compression, SSH support |
| pg_dump | Database | PostgreSQL | Logical backup |
| mysqldump | Database | MySQL | Logical backup |
| mongodump | Database | MongoDB | BSON export |
| backup gem | Ruby | Orchestration | Multi-backend DSL |
| aws-sdk-s3 | Ruby | Cloud storage | S3 integration |
| rufus-scheduler | Ruby | Scheduling | Cron-like scheduling |
Backup Verification Checklist
| Check | Method | Frequency | Critical |
|---|---|---|---|
| File integrity | Checksum validation | Every backup | Yes |
| Backup completion | Job status monitoring | Every backup | Yes |
| Storage availability | Health checks | Daily | Yes |
| Restore testing | Full restore to test env | Monthly | Yes |
| Performance metrics | Duration and size tracking | Every backup | No |
| Encryption validation | Key accessibility test | Weekly | Yes |
| Retention compliance | Age-based verification | Weekly | Yes |
| Documentation review | Procedure validation | Quarterly | No |
S3 Storage Classes
| Storage Class | Retrieval Time | Cost | Use Case |
|---|---|---|---|
| STANDARD | Immediate | Highest | Recent backups |
| STANDARD_IA | Immediate | Medium | 30-day backups |
| GLACIER | Minutes-hours | Low | Long-term backups |
| DEEP_ARCHIVE | 12 hours | Lowest | Compliance archives |
Recovery Process Steps
| Phase | Actions | Validation |
|---|---|---|
| Assessment | Identify scope of loss, determine recovery point | Confirm what needs restoration |
| Preparation | Locate backup, verify integrity, provision resources | Checksum validation, space check |
| Restoration | Execute restore procedure, monitor progress | Progress monitoring |
| Verification | Compare restored data, test functionality | Data validation, application tests |
| Production | Switch to restored system, monitor closely | Performance monitoring |
Encryption Standards
| Algorithm | Key Size | Use Case | Performance |
|---|---|---|---|
| AES-256-CBC | 256 bit | File encryption | Fast |
| AES-256-GCM | 256 bit | File encryption with auth | Fast |
| ChaCha20-Poly1305 | 256 bit | Stream encryption | Very fast |
| RSA-4096 | 4096 bit | Key encryption | Slow |
Backup Command Examples
# PostgreSQL backup
pg_dump -h localhost -U username -F c -f backup.dump database_name
# PostgreSQL restore
pg_restore -h localhost -U username -d database_name backup.dump
# MySQL backup
mysqldump -u username -p database_name > backup.sql
# MySQL restore
mysql -u username -p database_name < backup.sql
# Restic initialize
restic init -r /backup/repo
# Restic backup
restic -r /backup/repo backup /data
# Restic restore
restic -r /backup/repo restore latest --target /restore
# S3 sync
aws s3 sync /local/path s3://bucket/path --sse AES256