CrackedRuby CrackedRuby

Data Compression Techniques

Overview

Data compression reduces the size of data by encoding information more efficiently. The process transforms data from its original representation into a form that requires fewer bits while preserving the ability to reconstruct the original data, either exactly or approximately. Compression algorithms operate on the principle that most data contains redundancy or patterns that can be represented more compactly.

Two fundamental categories define compression approaches: lossless compression preserves all original data and allows perfect reconstruction, while lossy compression discards less important information to achieve higher compression ratios. Lossless compression applies to scenarios requiring exact data recovery, such as text files, executables, and archives. Lossy compression suits applications where approximate reconstruction suffices, such as images, audio, and video.

Compression effectiveness depends on the entropy of the source data. Entropy measures the average information content per symbol in a message. Data with high redundancy contains low entropy and compresses well, while random or already-compressed data contains high entropy and resists further compression. This relationship defines theoretical compression limits described by Shannon's source coding theorem.

# Basic compression demonstration
require 'zlib'

original = "aaaaaaabbbbbbcccccc"
compressed = Zlib::Deflate.deflate(original)

puts "Original size: #{original.bytesize} bytes"
puts "Compressed size: #{compressed.bytesize} bytes"
puts "Compression ratio: #{(1 - compressed.bytesize.to_f / original.bytesize) * 100}%"
# Original size: 19 bytes
# Compressed size: 15 bytes
# Compression ratio: 21.05%

The compression pipeline typically consists of multiple stages: analysis of input data characteristics, application of transformation to expose redundancy, encoding of transformed data using efficient representation, and optional post-processing for storage or transmission. Each stage contributes to overall compression effectiveness.

Hardware and software implementations differ significantly in performance characteristics. Hardware implementations achieve high throughput for specific algorithms through dedicated circuits, while software implementations provide flexibility across diverse platforms and data types. Modern processors include specialized instructions for compression operations, such as CRC calculation and bit manipulation, improving software compression performance.

Key Principles

Redundancy elimination forms the foundation of compression. Redundancy manifests in multiple forms: statistical redundancy occurs when symbols appear with non-uniform probability, spatial redundancy exists when neighboring data elements correlate, and temporal redundancy appears when data patterns repeat over time. Compression algorithms exploit these redundancies through different mechanisms.

Statistical encoding assigns shorter codes to frequently occurring symbols and longer codes to rare symbols. Huffman coding generates optimal prefix-free codes based on symbol frequencies, where no code serves as a prefix for another code. Arithmetic coding achieves better compression for sources with skewed probability distributions by encoding entire messages as single numbers in the interval [0, 1).

# Statistical encoding example using frequency analysis
def analyze_frequencies(data)
  frequencies = Hash.new(0)
  data.each_char { |char| frequencies[char] += 1 }
  frequencies.sort_by { |_, count| -count }
end

text = "compression uses frequency analysis"
frequencies = analyze_frequencies(text)

puts "Character frequencies (most common first):"
frequencies.take(5).each do |char, count|
  puts "  '#{char}': #{count} times"
end
# Character frequencies (most common first):
#   's': 5 times
#   'e': 3 times
#   'n': 3 times
#   'r': 2 times
#   'i': 2 times

Dictionary-based compression replaces repeated sequences with references to a dictionary containing previously seen data. LZ77 maintains a sliding window of recent data and encodes matches as (distance, length) pairs. LZ78 builds an explicit dictionary of sequences encountered during compression, adding new entries as unique patterns appear. LZW extends LZ78 by starting with a predefined dictionary of single symbols and dynamically expanding it during compression.

Transform coding reorganizes data to concentrate information in fewer coefficients. The Discrete Cosine Transform (DCT) converts spatial domain data into frequency domain representation, where most energy concentrates in low-frequency coefficients. Quantization then discards high-frequency coefficients with minimal perceptual impact, achieving compression through controlled information loss.

Run-length encoding handles sequences of repeated values by storing the value and repetition count. This simple technique works effectively for data with long runs of identical values, such as binary images or sparse arrays. The compression ratio depends directly on the average run length in the input data.

# Run-length encoding implementation
def run_length_encode(data)
  return [] if data.empty?
  
  encoded = []
  current_value = data[0]
  count = 1
  
  data[1..-1].each do |value|
    if value == current_value
      count += 1
    else
      encoded << [current_value, count]
      current_value = value
      count = 1
    end
  end
  
  encoded << [current_value, count]
  encoded
end

data = [1, 1, 1, 2, 2, 3, 3, 3, 3]
encoded = run_length_encode(data)
puts "Original: #{data.inspect}"
puts "Encoded: #{encoded.inspect}"
# Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
# Encoded: [[1, 3], [2, 2], [3, 4]]

Entropy encoding converts symbols into bit sequences based on probability distributions. Optimal codes satisfy the Kraft inequality, ensuring decodability. Symbol probabilities determine minimum average code length through the entropy formula: H(X) = -Σ p(x) log₂ p(x). Actual compression approaches but never exceeds this theoretical limit for stationary sources.

Context modeling improves compression by adapting probability estimates based on previously encoded symbols. Higher-order models consider longer context windows, providing better predictions at the cost of increased memory and computation. Adaptive models update probabilities during encoding, eliminating the need to transmit probability tables separately.

Prediction algorithms estimate the next symbol based on preceding symbols, then encode the prediction error rather than the original value. Linear predictors compute weighted sums of previous samples, while nonlinear predictors handle more complex dependencies. Prediction reduces effective entropy when correlations exist between adjacent symbols.

Ruby Implementation

Ruby provides compression through the Zlib library, which implements DEFLATE compression combining LZ77 and Huffman coding. The Zlib module offers both stream-based and one-shot compression interfaces. Stream-based compression processes data incrementally, suitable for large files or network streams, while one-shot methods compress entire strings in memory.

require 'zlib'

# One-shot compression
original_data = "The quick brown fox jumps over the lazy dog" * 100
compressed = Zlib::Deflate.deflate(original_data, Zlib::BEST_COMPRESSION)
decompressed = Zlib::Inflate.inflate(compressed)

puts "Original size: #{original_data.bytesize}"
puts "Compressed size: #{compressed.bytesize}"
puts "Compression ratio: #{compressed.bytesize.to_f / original_data.bytesize}"
puts "Data matches: #{original_data == decompressed}"
# Original size: 4400
# Compressed size: 54
# Compression ratio: 0.012272727272727273
# Data matches: true

The Zlib::Deflate class provides control over compression levels ranging from 0 (no compression) to 9 (maximum compression). Higher compression levels increase CPU usage while producing smaller output. Level selection depends on the trade-off between compression speed and output size for specific applications.

require 'zlib'
require 'benchmark'

data = File.read('/usr/share/dict/words')

Benchmark.bm(20) do |x|
  [Zlib::NO_COMPRESSION, Zlib::BEST_SPEED, Zlib::DEFAULT_COMPRESSION, 
   Zlib::BEST_COMPRESSION].each do |level|
    x.report("Level #{level}:") do
      compressed = Zlib::Deflate.deflate(data, level)
      puts "  Size: #{compressed.bytesize} bytes"
    end
  end
end

Stream-based compression handles data too large for memory by processing chunks incrementally. The GzipWriter and GzipReader classes wrap IO objects, transparently compressing or decompressing data during write and read operations. This approach integrates naturally with file and network I/O.

require 'zlib'

# Stream-based file compression
def compress_file(input_path, output_path)
  Zlib::GzipWriter.open(output_path) do |gz|
    File.open(input_path, 'rb') do |file|
      while chunk = file.read(8192)
        gz.write(chunk)
      end
    end
  end
end

# Stream-based file decompression
def decompress_file(input_path, output_path)
  Zlib::GzipReader.open(input_path) do |gz|
    File.open(output_path, 'wb') do |file|
      while chunk = gz.read(8192)
        file.write(chunk)
      end
    end
  end
end

compress_file('large_file.txt', 'large_file.txt.gz')
decompress_file('large_file.txt.gz', 'restored_file.txt')

The StringIO class enables in-memory stream operations, useful for testing or processing string data with compression APIs designed for streams. This approach combines the flexibility of stream-based processing with the performance of in-memory operations.

require 'zlib'
require 'stringio'

# Compress string data using stream interface
input = "Repeated data " * 1000
output = StringIO.new

Zlib::GzipWriter.wrap(output) do |gz|
  gz.write(input)
end

compressed = output.string
puts "Input size: #{input.bytesize}"
puts "Compressed size: #{compressed.bytesize}"
# Input size: 14000
# Compressed size: 43

Custom compression implementations demonstrate algorithm fundamentals. A basic Huffman encoder builds a frequency table, constructs an optimal prefix tree, generates codes, and encodes the input. This implementation serves educational purposes rather than production use.

# Simplified Huffman coding implementation
class HuffmanNode
  attr_accessor :char, :frequency, :left, :right
  
  def initialize(char, frequency, left = nil, right = nil)
    @char = char
    @frequency = frequency
    @left = left
    @right = right
  end
  
  def leaf?
    @left.nil? && @right.nil?
  end
end

def build_huffman_tree(frequencies)
  nodes = frequencies.map { |char, freq| HuffmanNode.new(char, freq) }
  
  while nodes.size > 1
    nodes.sort_by!(&:frequency)
    left = nodes.shift
    right = nodes.shift
    parent = HuffmanNode.new(nil, left.frequency + right.frequency, left, right)
    nodes << parent
  end
  
  nodes.first
end

def generate_codes(node, prefix = '', codes = {})
  return codes if node.nil?
  
  if node.leaf?
    codes[node.char] = prefix
  else
    generate_codes(node.left, prefix + '0', codes)
    generate_codes(node.right, prefix + '1', codes)
  end
  
  codes
end

text = "huffman encoding example"
frequencies = Hash.new(0)
text.each_char { |char| frequencies[char] += 1 }

tree = build_huffman_tree(frequencies)
codes = generate_codes(tree)

puts "Huffman codes:"
codes.sort_by { |_, code| code.length }.each do |char, code|
  puts "  '#{char}': #{code}"
end

Practical Examples

Text file compression demonstrates common compression workflow patterns. Applications typically compress configuration files, logs, and documentation to reduce storage requirements and transfer times. The compression ratio depends on text redundancy, with natural language text typically achieving 2:1 to 3:1 compression.

require 'zlib'

# Compress and decompress text files
class TextCompressor
  def self.compress_file(input_file, output_file)
    input_text = File.read(input_file)
    compressed = Zlib::Deflate.deflate(input_text, Zlib::BEST_COMPRESSION)
    
    File.open(output_file, 'wb') do |f|
      f.write([input_text.bytesize].pack('Q'))  # Store original size
      f.write(compressed)
    end
    
    {
      original_size: input_text.bytesize,
      compressed_size: compressed.bytesize + 8,
      ratio: compressed.bytesize.to_f / input_text.bytesize
    }
  end
  
  def self.decompress_file(input_file, output_file)
    File.open(input_file, 'rb') do |f|
      original_size = f.read(8).unpack1('Q')
      compressed = f.read
      
      decompressed = Zlib::Inflate.inflate(compressed)
      
      if decompressed.bytesize != original_size
        raise "Size mismatch: expected #{original_size}, got #{decompressed.bytesize}"
      end
      
      File.write(output_file, decompressed)
    end
  end
end

# Example usage
stats = TextCompressor.compress_file('document.txt', 'document.txt.zz')
puts "Compressed: #{stats[:original_size]}#{stats[:compressed_size]} bytes"
puts "Ratio: #{(stats[:ratio] * 100).round(2)}%"

JSON data compression applies to API responses and configuration files containing structured data. JSON text contains significant redundancy through repeated key names, whitespace, and structural patterns. Compression before transmission reduces bandwidth consumption.

require 'zlib'
require 'json'

# Compress JSON API responses
class JsonCompressor
  def self.compress(data)
    json_string = JSON.generate(data)
    compressed = Zlib::Deflate.deflate(json_string)
    Base64.strict_encode64(compressed)
  end
  
  def self.decompress(compressed_string)
    compressed = Base64.strict_decode64(compressed_string)
    json_string = Zlib::Inflate.inflate(compressed)
    JSON.parse(json_string)
  end
end

# Sample data with redundancy
users = 100.times.map do |i|
  {
    user_id: i,
    username: "user#{i}",
    email: "user#{i}@example.com",
    status: "active",
    created_at: "2024-01-01T00:00:00Z"
  }
end

original_json = JSON.generate(users)
compressed = JsonCompressor.compress(users)

puts "Original JSON: #{original_json.bytesize} bytes"
puts "Compressed: #{compressed.bytesize} bytes"
puts "Ratio: #{(compressed.bytesize.to_f / original_json.bytesize * 100).round(2)}%"

# Verify round-trip
restored = JsonCompressor.decompress(compressed)
puts "Data integrity: #{users == restored}"

Binary data compression handles executable files, images, and other non-text formats. Binary data often contains less redundancy than text, resulting in lower compression ratios. Compression algorithms must handle arbitrary byte sequences without assuming text encoding.

require 'zlib'

# Binary file compression with integrity checking
class BinaryCompressor
  def self.compress_with_checksum(input_data)
    checksum = Zlib.crc32(input_data)
    compressed = Zlib::Deflate.deflate(input_data, Zlib::BEST_COMPRESSION)
    
    # Store checksum and compressed data
    [checksum].pack('L') + compressed
  end
  
  def self.decompress_with_verification(compressed_data)
    stored_checksum = compressed_data[0, 4].unpack1('L')
    compressed = compressed_data[4..-1]
    
    decompressed = Zlib::Inflate.inflate(compressed)
    calculated_checksum = Zlib.crc32(decompressed)
    
    if stored_checksum != calculated_checksum
      raise "Checksum mismatch: data corruption detected"
    end
    
    decompressed
  end
end

# Example with binary data
binary_data = 10000.times.map { rand(256) }.pack('C*')
compressed = BinaryCompressor.compress_with_checksum(binary_data)

puts "Original: #{binary_data.bytesize} bytes"
puts "Compressed: #{compressed.bytesize} bytes"

restored = BinaryCompressor.decompress_with_verification(compressed)
puts "Verification: #{binary_data == restored}"

Archive creation combines multiple files into a single compressed container. This pattern appears in backup systems, software distribution, and file transfer utilities. The archive format stores file metadata alongside compressed content.

require 'zlib'
require 'json'

# Simple archive format for multiple files
class SimpleArchive
  def self.create(output_file, files)
    File.open(output_file, 'wb') do |archive|
      manifest = []
      
      files.each do |input_file|
        data = File.read(input_file, mode: 'rb')
        compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
        
        manifest << {
          name: File.basename(input_file),
          original_size: data.bytesize,
          compressed_size: compressed.bytesize,
          offset: archive.pos
        }
        
        archive.write(compressed)
      end
      
      # Write manifest at end
      manifest_json = JSON.generate(manifest)
      archive.write([manifest_json.bytesize].pack('Q'))
      archive.write(manifest_json)
    end
  end
  
  def self.extract(archive_file, output_dir)
    File.open(archive_file, 'rb') do |archive|
      # Read manifest from end
      archive.seek(-8, IO::SEEK_END)
      manifest_size = archive.read(8).unpack1('Q')
      archive.seek(-(8 + manifest_size), IO::SEEK_END)
      manifest = JSON.parse(archive.read(manifest_size))
      
      # Extract each file
      manifest.each do |entry|
        archive.seek(entry['offset'])
        compressed = archive.read(entry['compressed_size'])
        data = Zlib::Inflate.inflate(compressed)
        
        output_path = File.join(output_dir, entry['name'])
        File.write(output_path, data, mode: 'wb')
        
        puts "Extracted: #{entry['name']} (#{data.bytesize} bytes)"
      end
    end
  end
end

Performance Considerations

Compression speed depends on algorithm complexity and data characteristics. Dictionary-based algorithms like DEFLATE perform more work per byte than simple schemes like run-length encoding. The compression level parameter trades CPU time for output size, with higher levels requiring exponentially more computation for diminishing returns.

Memory usage varies significantly between algorithms. Streaming algorithms process fixed-size windows, maintaining constant memory requirements regardless of input size. Dictionary-based methods require memory proportional to window size, typically 32KB for DEFLATE. Block-based algorithms may buffer entire blocks before processing.

Decompression generally executes faster than compression because it performs lookup operations rather than search operations. Dictionary-based decompression reads backward references and copies data, avoiding the pattern matching required during compression. This asymmetry makes compression suitable for write-once, read-many scenarios.

require 'zlib'
require 'benchmark'

# Compare compression levels
data = File.read('/usr/share/dict/words')

results = {}
Benchmark.bm(30) do |x|
  (0..9).each do |level|
    time = x.report("Compression level #{level}:") do
      compressed = Zlib::Deflate.deflate(data, level)
      results[level] = {
        size: compressed.bytesize,
        ratio: compressed.bytesize.to_f / data.bytesize
      }
    end
  end
  
  x.report("Decompression:") do
    compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
    5.times { Zlib::Inflate.inflate(compressed) }
  end
end

puts "\nSize analysis:"
results.each do |level, stats|
  puts "Level #{level}: #{stats[:size]} bytes (#{(stats[:ratio] * 100).round(2)}%)"
end

Parallel compression improves throughput by processing independent data segments concurrently. Block-based formats divide input into chunks that compress independently, enabling multi-threaded compression without coordination overhead. The PIGZ format extends GZIP with parallel compression while maintaining format compatibility.

require 'zlib'
require 'thread'

# Parallel compression of independent blocks
class ParallelCompressor
  def self.compress(data, num_threads: 4, block_size: 1_048_576)
    blocks = []
    offset = 0
    
    while offset < data.bytesize
      blocks << data[offset, block_size]
      offset += block_size
    end
    
    compressed_blocks = Array.new(blocks.size)
    queue = Queue.new
    blocks.each_with_index { |block, idx| queue << [block, idx] }
    
    threads = num_threads.times.map do
      Thread.new do
        while !queue.empty?
          block, idx = queue.pop(true) rescue nil
          next unless block
          
          compressed_blocks[idx] = {
            data: Zlib::Deflate.deflate(block, Zlib::BEST_SPEED),
            original_size: block.bytesize
          }
        end
      end
    end
    
    threads.each(&:join)
    compressed_blocks
  end
end

# Example usage
large_data = "Sample text block " * 100000
result = ParallelCompressor.compress(large_data, num_threads: 4)
total_compressed = result.sum { |block| block[:data].bytesize }
total_original = result.sum { |block| block[:original_size] }

puts "Compressed #{result.size} blocks in parallel"
puts "Total compression: #{total_original}#{total_compressed} bytes"

Cache efficiency affects compression performance significantly. Sequential memory access patterns enable CPU prefetching and cache line utilization. Random access patterns during dictionary lookups cause cache misses. Buffer sizing impacts cache behavior, with 4KB to 16KB buffers typically optimal for L1/L2 cache utilization.

Compression trade-offs balance multiple objectives: compression ratio, compression speed, decompression speed, and memory usage. No single algorithm optimizes all dimensions simultaneously. Applications select algorithms based on their specific performance requirements and resource constraints.

Common Patterns

The compress-once, decompress-many pattern optimizes for read-heavy workloads. Applications compress data during writes and store the compressed form, decompressing on each read. This trade-off reduces storage costs and network bandwidth at the expense of CPU during reads. Static content delivery and backup systems commonly apply this pattern.

require 'zlib'

# Compressed cache implementation
class CompressedCache
  def initialize
    @cache = {}
  end
  
  def set(key, value)
    compressed = Zlib::Deflate.deflate(value, Zlib::BEST_COMPRESSION)
    @cache[key] = {
      data: compressed,
      original_size: value.bytesize,
      compressed_size: compressed.bytesize
    }
  end
  
  def get(key)
    entry = @cache[key]
    return nil unless entry
    
    Zlib::Inflate.inflate(entry[:data])
  end
  
  def stats
    total_original = @cache.values.sum { |e| e[:original_size] }
    total_compressed = @cache.values.sum { |e| e[:compressed_size] }
    
    {
      entries: @cache.size,
      original_size: total_original,
      compressed_size: total_compressed,
      ratio: total_compressed.to_f / total_original
    }
  end
end

cache = CompressedCache.new
cache.set('doc1', "Large document " * 1000)
cache.set('doc2', "Another document " * 1000)

puts "Retrieved: #{cache.get('doc1')[0..50]}..."
puts "Cache stats: #{cache.stats}"

Transparent compression integrates compression into I/O operations invisibly to application code. Wrapper classes intercept read and write operations, applying compression automatically. This pattern simplifies application code while providing compression benefits without modifying business logic.

require 'zlib'

# Transparent compression for file operations
class CompressedFile
  def initialize(path, mode)
    @path = path
    @mode = mode
    @file = nil
    @compressor = nil
  end
  
  def open
    if @mode.include?('w')
      @file = File.open(@path, 'wb')
      @compressor = Zlib::GzipWriter.new(@file)
    else
      @file = File.open(@path, 'rb')
      @compressor = Zlib::GzipReader.new(@file)
    end
  end
  
  def write(data)
    @compressor.write(data)
  end
  
  def read(length = nil)
    @compressor.read(length)
  end
  
  def close
    @compressor.close
    @file.close
  end
  
  def self.open(path, mode)
    file = new(path, mode)
    file.open
    
    if block_given?
      begin
        yield file
      ensure
        file.close
      end
    else
      file
    end
  end
end

# Usage identical to regular File operations
CompressedFile.open('data.gz', 'w') do |f|
  f.write("Transparently compressed data\n")
  f.write("Application code unchanged\n")
end

CompressedFile.open('data.gz', 'r') do |f|
  puts f.read
end

Adaptive compression selects algorithms or parameters based on data characteristics detected during processing. Analysis of initial data blocks determines optimal compression strategy. This pattern maximizes compression effectiveness across heterogeneous data sets.

require 'zlib'

# Adaptive compression based on data analysis
class AdaptiveCompressor
  SAMPLE_SIZE = 4096
  
  def self.analyze_data(data)
    sample = data[0, SAMPLE_SIZE]
    unique_bytes = sample.bytes.uniq.size
    
    # Calculate entropy estimate
    frequencies = Hash.new(0)
    sample.each_byte { |b| frequencies[b] += 1 }
    
    entropy = frequencies.values.reduce(0.0) do |sum, freq|
      p = freq.to_f / sample.bytesize
      sum - (p * Math.log2(p))
    end
    
    {
      entropy: entropy,
      unique_bytes: unique_bytes,
      compressibility: entropy < 6.0 ? :high : :low
    }
  end
  
  def self.compress(data)
    analysis = analyze_data(data)
    
    level = case analysis[:compressibility]
    when :high
      Zlib::BEST_COMPRESSION
    when :low
      Zlib::BEST_SPEED
    end
    
    compressed = Zlib::Deflate.deflate(data, level)
    
    {
      compressed: compressed,
      analysis: analysis,
      level: level,
      ratio: compressed.bytesize.to_f / data.bytesize
    }
  end
end

# Test with different data types
text_data = "Repeated text " * 1000
random_data = 10000.times.map { rand(256) }.pack('C*')

text_result = AdaptiveCompressor.compress(text_data)
random_result = AdaptiveCompressor.compress(random_data)

puts "Text data:"
puts "  Entropy: #{text_result[:analysis][:entropy].round(2)}"
puts "  Level: #{text_result[:level]}"
puts "  Ratio: #{(text_result[:ratio] * 100).round(2)}%"

puts "\nRandom data:"
puts "  Entropy: #{random_result[:analysis][:entropy].round(2)}"
puts "  Level: #{random_result[:level]}"
puts "  Ratio: #{(random_result[:ratio] * 100).round(2)}%"

Delta compression encodes differences between versions rather than complete content. Version control systems and backup applications use delta compression to minimize storage when data changes incrementally. The technique computes differences between old and new versions, compressing the delta stream.

require 'zlib'

# Simple delta compression for versioned data
class DeltaCompressor
  def self.compute_delta(old_data, new_data)
    # Simple byte-level delta (production systems use sophisticated diff algorithms)
    delta = []
    max_length = [old_data.bytesize, new_data.bytesize].max
    
    (0...max_length).each do |i|
      old_byte = i < old_data.bytesize ? old_data.getbyte(i) : nil
      new_byte = i < new_data.bytesize ? new_data.getbyte(i) : nil
      
      if old_byte != new_byte
        delta << { position: i, old: old_byte, new: new_byte }
      end
    end
    
    delta
  end
  
  def self.compress_delta(delta)
    # Compress the delta information
    delta_json = JSON.generate(delta)
    Zlib::Deflate.deflate(delta_json, Zlib::BEST_COMPRESSION)
  end
  
  def self.apply_delta(old_data, compressed_delta)
    delta_json = Zlib::Inflate.inflate(compressed_delta)
    delta = JSON.parse(delta_json)
    
    result = old_data.dup
    delta.each do |change|
      pos = change['position']
      new_byte = change['new']
      
      if new_byte
        result.setbyte(pos, new_byte)
      end
    end
    
    result
  end
end

Tools & Ecosystem

The Zlib library provides the foundation for compression in Ruby, implementing DEFLATE compression used by GZIP and ZIP formats. Zlib ships with Ruby as a standard library, requiring no additional installation. The library supports both one-shot and streaming compression with configurable compression levels and memory usage.

Ruby gem ecosystem includes specialized compression libraries for various formats. The bzip2-ruby gem implements BZ2 compression, achieving higher compression ratios than DEFLATE at the cost of slower compression speed. The lz4-ruby gem provides LZ4 compression, optimizing for decompression speed over compression ratio.

require 'zlib'

# Comparing compression libraries
class CompressionBenchmark
  def self.compare(data)
    results = {}
    
    # Zlib (DEFLATE)
    time_start = Time.now
    zlib_compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
    zlib_time = Time.now - time_start
    
    results[:zlib] = {
      size: zlib_compressed.bytesize,
      time: zlib_time,
      ratio: zlib_compressed.bytesize.to_f / data.bytesize
    }
    
    results
  end
end

sample_data = File.read('/usr/share/dict/words')
results = CompressionBenchmark.compare(sample_data)

results.each do |lib, stats|
  puts "#{lib.capitalize}:"
  puts "  Compressed size: #{stats[:size]} bytes"
  puts "  Ratio: #{(stats[:ratio] * 100).round(2)}%"
  puts "  Time: #{(stats[:time] * 1000).round(2)}ms"
end

Archive manipulation tools handle ZIP, TAR, and other archive formats. The rubyzip gem provides comprehensive ZIP file support, including creation, extraction, and modification of ZIP archives. The minitar gem implements TAR archive handling, commonly combined with GZIP compression for TGZ files.

Command-line integration connects Ruby programs with system compression utilities. The gzip, bzip2, and xz commands provide high-performance compression through system calls. Ruby programs invoke these utilities via backticks, system calls, or Open3 for advanced I/O handling.

require 'open3'

# Using system compression utilities
class SystemCompressor
  def self.gzip_compress(input_file, output_file)
    stdout, stderr, status = Open3.capture3(
      'gzip', '-9', '-c', input_file
    )
    
    if status.success?
      File.write(output_file, stdout, mode: 'wb')
      { success: true, size: stdout.bytesize }
    else
      { success: false, error: stderr }
    end
  end
  
  def self.compare_compression(file_path)
    original_size = File.size(file_path)
    results = {}
    
    ['gzip', 'bzip2', 'xz'].each do |utility|
      temp_output = "/tmp/compressed.#{utility}"
      
      stdout, stderr, status = Open3.capture3(
        utility, '-9', '-c', file_path
      )
      
      if status.success?
        File.write(temp_output, stdout, mode: 'wb')
        compressed_size = File.size(temp_output)
        
        results[utility] = {
          size: compressed_size,
          ratio: compressed_size.to_f / original_size
        }
        
        File.delete(temp_output)
      end
    end
    
    results
  end
end

HTTP compression middleware applies transparent compression to web responses. The Rack::Deflater middleware compresses HTTP responses using GZIP when clients indicate support through Accept-Encoding headers. This reduces bandwidth consumption for web applications without modifying application code.

Database compression reduces storage requirements for large datasets. PostgreSQL supports TOAST compression for large field values. Ruby applications leverage database compression through standard database adapters, with compression occurring transparently at the database layer.

Reference

Compression Algorithm Comparison

Algorithm Type Ratio Speed Use Case
DEFLATE Lossless 2-3x Medium General purpose
LZ4 Lossless 1.5-2x Fast Real-time systems
BZIP2 Lossless 3-4x Slow Maximum compression
LZMA Lossless 3-5x Very slow Archives
Snappy Lossless 1.5-2x Very fast Streaming data
Zstandard Lossless 2-3x Fast Modern applications
RLE Lossless Variable Very fast Simple repetitive data

Ruby Zlib Constants

Constant Value Description
NO_COMPRESSION 0 No compression
BEST_SPEED 1 Fast compression
DEFAULT_COMPRESSION -1 Default level (6)
BEST_COMPRESSION 9 Maximum compression
FILTERED 1 Filtered data strategy
HUFFMAN_ONLY 2 Huffman coding only
RLE 3 Run-length encoding
FIXED 4 Fixed Huffman codes

Compression Level Effects

Level Ratio Impact Speed Impact Typical Use
0 No compression Fastest Testing, already compressed
1-3 Low compression Fast Real-time streaming
4-6 Moderate compression Balanced General purpose
7-9 High compression Slow Archives, storage

Common File Format Headers

Format Magic Bytes Extension Description
GZIP 1f 8b .gz DEFLATE compression
ZIP 50 4b 03 04 .zip Archive with DEFLATE
BZIP2 42 5a 68 .bz2 Burrows-Wheeler compression
XZ fd 37 7a 58 5a 00 .xz LZMA2 compression
ZSTD 28 b5 2f fd .zst Zstandard compression

Zlib Stream Classes

Class Purpose Usage Pattern
Zlib::Deflate Compression stream Create, write data, finish
Zlib::Inflate Decompression stream Create, write compressed, finish
Zlib::GzipWriter GZIP file writer Open, write, close
Zlib::GzipReader GZIP file reader Open, read, close
Zlib::ZStream Base stream class Not instantiated directly

Memory Requirements by Algorithm

Algorithm Window Size Dictionary Size Typical Memory
DEFLATE 32 KB 32 KB 256-384 KB
LZ4 64 KB None 64 KB
BZIP2 900 KB None 3-4 MB
LZMA 1-16 MB 1-16 MB 2-32 MB

Error Handling Patterns

Error Type Detection Recovery Strategy
Corrupt data Checksum mismatch Reject decompression
Truncated stream Premature EOF Request retransmission
Memory exhaustion Allocation failure Reduce buffer size
Invalid format Magic byte check Identify correct format
Version mismatch Header version Update library or convert

Performance Optimization Checklist

Optimization Impact Implementation
Choose appropriate level High Match level to use case
Use streaming API High Process large files incrementally
Enable parallel compression High Split into independent blocks
Reuse compressor objects Medium Avoid repeated initialization
Tune buffer sizes Medium Align with cache lines
Pre-allocate buffers Low Reduce allocation overhead

Common Operations Quick Reference

# Compress string
Zlib::Deflate.deflate(string, level)

# Decompress string
Zlib::Inflate.inflate(compressed)

# Compress file
Zlib::GzipWriter.open(path) { |gz| gz.write(data) }

# Decompress file
Zlib::GzipReader.open(path) { |gz| gz.read }

# Calculate checksum
Zlib.crc32(data)
Zlib.adler32(data)

# Streaming compression
deflater = Zlib::Deflate.new(level)
deflater << chunk
compressed = deflater.finish