Overview
Data compression reduces the size of data by encoding information more efficiently. The process transforms data from its original representation into a form that requires fewer bits while preserving the ability to reconstruct the original data, either exactly or approximately. Compression algorithms operate on the principle that most data contains redundancy or patterns that can be represented more compactly.
Two fundamental categories define compression approaches: lossless compression preserves all original data and allows perfect reconstruction, while lossy compression discards less important information to achieve higher compression ratios. Lossless compression applies to scenarios requiring exact data recovery, such as text files, executables, and archives. Lossy compression suits applications where approximate reconstruction suffices, such as images, audio, and video.
Compression effectiveness depends on the entropy of the source data. Entropy measures the average information content per symbol in a message. Data with high redundancy contains low entropy and compresses well, while random or already-compressed data contains high entropy and resists further compression. This relationship defines theoretical compression limits described by Shannon's source coding theorem.
# Basic compression demonstration
require 'zlib'
original = "aaaaaaabbbbbbcccccc"
compressed = Zlib::Deflate.deflate(original)
puts "Original size: #{original.bytesize} bytes"
puts "Compressed size: #{compressed.bytesize} bytes"
puts "Compression ratio: #{(1 - compressed.bytesize.to_f / original.bytesize) * 100}%"
# Original size: 19 bytes
# Compressed size: 15 bytes
# Compression ratio: 21.05%
The compression pipeline typically consists of multiple stages: analysis of input data characteristics, application of transformation to expose redundancy, encoding of transformed data using efficient representation, and optional post-processing for storage or transmission. Each stage contributes to overall compression effectiveness.
Hardware and software implementations differ significantly in performance characteristics. Hardware implementations achieve high throughput for specific algorithms through dedicated circuits, while software implementations provide flexibility across diverse platforms and data types. Modern processors include specialized instructions for compression operations, such as CRC calculation and bit manipulation, improving software compression performance.
Key Principles
Redundancy elimination forms the foundation of compression. Redundancy manifests in multiple forms: statistical redundancy occurs when symbols appear with non-uniform probability, spatial redundancy exists when neighboring data elements correlate, and temporal redundancy appears when data patterns repeat over time. Compression algorithms exploit these redundancies through different mechanisms.
Statistical encoding assigns shorter codes to frequently occurring symbols and longer codes to rare symbols. Huffman coding generates optimal prefix-free codes based on symbol frequencies, where no code serves as a prefix for another code. Arithmetic coding achieves better compression for sources with skewed probability distributions by encoding entire messages as single numbers in the interval [0, 1).
# Statistical encoding example using frequency analysis
def analyze_frequencies(data)
frequencies = Hash.new(0)
data.each_char { |char| frequencies[char] += 1 }
frequencies.sort_by { |_, count| -count }
end
text = "compression uses frequency analysis"
frequencies = analyze_frequencies(text)
puts "Character frequencies (most common first):"
frequencies.take(5).each do |char, count|
puts " '#{char}': #{count} times"
end
# Character frequencies (most common first):
# 's': 5 times
# 'e': 3 times
# 'n': 3 times
# 'r': 2 times
# 'i': 2 times
Dictionary-based compression replaces repeated sequences with references to a dictionary containing previously seen data. LZ77 maintains a sliding window of recent data and encodes matches as (distance, length) pairs. LZ78 builds an explicit dictionary of sequences encountered during compression, adding new entries as unique patterns appear. LZW extends LZ78 by starting with a predefined dictionary of single symbols and dynamically expanding it during compression.
Transform coding reorganizes data to concentrate information in fewer coefficients. The Discrete Cosine Transform (DCT) converts spatial domain data into frequency domain representation, where most energy concentrates in low-frequency coefficients. Quantization then discards high-frequency coefficients with minimal perceptual impact, achieving compression through controlled information loss.
Run-length encoding handles sequences of repeated values by storing the value and repetition count. This simple technique works effectively for data with long runs of identical values, such as binary images or sparse arrays. The compression ratio depends directly on the average run length in the input data.
# Run-length encoding implementation
def run_length_encode(data)
return [] if data.empty?
encoded = []
current_value = data[0]
count = 1
data[1..-1].each do |value|
if value == current_value
count += 1
else
encoded << [current_value, count]
current_value = value
count = 1
end
end
encoded << [current_value, count]
encoded
end
data = [1, 1, 1, 2, 2, 3, 3, 3, 3]
encoded = run_length_encode(data)
puts "Original: #{data.inspect}"
puts "Encoded: #{encoded.inspect}"
# Original: [1, 1, 1, 2, 2, 3, 3, 3, 3]
# Encoded: [[1, 3], [2, 2], [3, 4]]
Entropy encoding converts symbols into bit sequences based on probability distributions. Optimal codes satisfy the Kraft inequality, ensuring decodability. Symbol probabilities determine minimum average code length through the entropy formula: H(X) = -Σ p(x) log₂ p(x). Actual compression approaches but never exceeds this theoretical limit for stationary sources.
Context modeling improves compression by adapting probability estimates based on previously encoded symbols. Higher-order models consider longer context windows, providing better predictions at the cost of increased memory and computation. Adaptive models update probabilities during encoding, eliminating the need to transmit probability tables separately.
Prediction algorithms estimate the next symbol based on preceding symbols, then encode the prediction error rather than the original value. Linear predictors compute weighted sums of previous samples, while nonlinear predictors handle more complex dependencies. Prediction reduces effective entropy when correlations exist between adjacent symbols.
Ruby Implementation
Ruby provides compression through the Zlib library, which implements DEFLATE compression combining LZ77 and Huffman coding. The Zlib module offers both stream-based and one-shot compression interfaces. Stream-based compression processes data incrementally, suitable for large files or network streams, while one-shot methods compress entire strings in memory.
require 'zlib'
# One-shot compression
original_data = "The quick brown fox jumps over the lazy dog" * 100
compressed = Zlib::Deflate.deflate(original_data, Zlib::BEST_COMPRESSION)
decompressed = Zlib::Inflate.inflate(compressed)
puts "Original size: #{original_data.bytesize}"
puts "Compressed size: #{compressed.bytesize}"
puts "Compression ratio: #{compressed.bytesize.to_f / original_data.bytesize}"
puts "Data matches: #{original_data == decompressed}"
# Original size: 4400
# Compressed size: 54
# Compression ratio: 0.012272727272727273
# Data matches: true
The Zlib::Deflate class provides control over compression levels ranging from 0 (no compression) to 9 (maximum compression). Higher compression levels increase CPU usage while producing smaller output. Level selection depends on the trade-off between compression speed and output size for specific applications.
require 'zlib'
require 'benchmark'
data = File.read('/usr/share/dict/words')
Benchmark.bm(20) do |x|
[Zlib::NO_COMPRESSION, Zlib::BEST_SPEED, Zlib::DEFAULT_COMPRESSION,
Zlib::BEST_COMPRESSION].each do |level|
x.report("Level #{level}:") do
compressed = Zlib::Deflate.deflate(data, level)
puts " Size: #{compressed.bytesize} bytes"
end
end
end
Stream-based compression handles data too large for memory by processing chunks incrementally. The GzipWriter and GzipReader classes wrap IO objects, transparently compressing or decompressing data during write and read operations. This approach integrates naturally with file and network I/O.
require 'zlib'
# Stream-based file compression
def compress_file(input_path, output_path)
Zlib::GzipWriter.open(output_path) do |gz|
File.open(input_path, 'rb') do |file|
while chunk = file.read(8192)
gz.write(chunk)
end
end
end
end
# Stream-based file decompression
def decompress_file(input_path, output_path)
Zlib::GzipReader.open(input_path) do |gz|
File.open(output_path, 'wb') do |file|
while chunk = gz.read(8192)
file.write(chunk)
end
end
end
end
compress_file('large_file.txt', 'large_file.txt.gz')
decompress_file('large_file.txt.gz', 'restored_file.txt')
The StringIO class enables in-memory stream operations, useful for testing or processing string data with compression APIs designed for streams. This approach combines the flexibility of stream-based processing with the performance of in-memory operations.
require 'zlib'
require 'stringio'
# Compress string data using stream interface
input = "Repeated data " * 1000
output = StringIO.new
Zlib::GzipWriter.wrap(output) do |gz|
gz.write(input)
end
compressed = output.string
puts "Input size: #{input.bytesize}"
puts "Compressed size: #{compressed.bytesize}"
# Input size: 14000
# Compressed size: 43
Custom compression implementations demonstrate algorithm fundamentals. A basic Huffman encoder builds a frequency table, constructs an optimal prefix tree, generates codes, and encodes the input. This implementation serves educational purposes rather than production use.
# Simplified Huffman coding implementation
class HuffmanNode
attr_accessor :char, :frequency, :left, :right
def initialize(char, frequency, left = nil, right = nil)
@char = char
@frequency = frequency
@left = left
@right = right
end
def leaf?
@left.nil? && @right.nil?
end
end
def build_huffman_tree(frequencies)
nodes = frequencies.map { |char, freq| HuffmanNode.new(char, freq) }
while nodes.size > 1
nodes.sort_by!(&:frequency)
left = nodes.shift
right = nodes.shift
parent = HuffmanNode.new(nil, left.frequency + right.frequency, left, right)
nodes << parent
end
nodes.first
end
def generate_codes(node, prefix = '', codes = {})
return codes if node.nil?
if node.leaf?
codes[node.char] = prefix
else
generate_codes(node.left, prefix + '0', codes)
generate_codes(node.right, prefix + '1', codes)
end
codes
end
text = "huffman encoding example"
frequencies = Hash.new(0)
text.each_char { |char| frequencies[char] += 1 }
tree = build_huffman_tree(frequencies)
codes = generate_codes(tree)
puts "Huffman codes:"
codes.sort_by { |_, code| code.length }.each do |char, code|
puts " '#{char}': #{code}"
end
Practical Examples
Text file compression demonstrates common compression workflow patterns. Applications typically compress configuration files, logs, and documentation to reduce storage requirements and transfer times. The compression ratio depends on text redundancy, with natural language text typically achieving 2:1 to 3:1 compression.
require 'zlib'
# Compress and decompress text files
class TextCompressor
def self.compress_file(input_file, output_file)
input_text = File.read(input_file)
compressed = Zlib::Deflate.deflate(input_text, Zlib::BEST_COMPRESSION)
File.open(output_file, 'wb') do |f|
f.write([input_text.bytesize].pack('Q')) # Store original size
f.write(compressed)
end
{
original_size: input_text.bytesize,
compressed_size: compressed.bytesize + 8,
ratio: compressed.bytesize.to_f / input_text.bytesize
}
end
def self.decompress_file(input_file, output_file)
File.open(input_file, 'rb') do |f|
original_size = f.read(8).unpack1('Q')
compressed = f.read
decompressed = Zlib::Inflate.inflate(compressed)
if decompressed.bytesize != original_size
raise "Size mismatch: expected #{original_size}, got #{decompressed.bytesize}"
end
File.write(output_file, decompressed)
end
end
end
# Example usage
stats = TextCompressor.compress_file('document.txt', 'document.txt.zz')
puts "Compressed: #{stats[:original_size]} → #{stats[:compressed_size]} bytes"
puts "Ratio: #{(stats[:ratio] * 100).round(2)}%"
JSON data compression applies to API responses and configuration files containing structured data. JSON text contains significant redundancy through repeated key names, whitespace, and structural patterns. Compression before transmission reduces bandwidth consumption.
require 'zlib'
require 'json'
# Compress JSON API responses
class JsonCompressor
def self.compress(data)
json_string = JSON.generate(data)
compressed = Zlib::Deflate.deflate(json_string)
Base64.strict_encode64(compressed)
end
def self.decompress(compressed_string)
compressed = Base64.strict_decode64(compressed_string)
json_string = Zlib::Inflate.inflate(compressed)
JSON.parse(json_string)
end
end
# Sample data with redundancy
users = 100.times.map do |i|
{
user_id: i,
username: "user#{i}",
email: "user#{i}@example.com",
status: "active",
created_at: "2024-01-01T00:00:00Z"
}
end
original_json = JSON.generate(users)
compressed = JsonCompressor.compress(users)
puts "Original JSON: #{original_json.bytesize} bytes"
puts "Compressed: #{compressed.bytesize} bytes"
puts "Ratio: #{(compressed.bytesize.to_f / original_json.bytesize * 100).round(2)}%"
# Verify round-trip
restored = JsonCompressor.decompress(compressed)
puts "Data integrity: #{users == restored}"
Binary data compression handles executable files, images, and other non-text formats. Binary data often contains less redundancy than text, resulting in lower compression ratios. Compression algorithms must handle arbitrary byte sequences without assuming text encoding.
require 'zlib'
# Binary file compression with integrity checking
class BinaryCompressor
def self.compress_with_checksum(input_data)
checksum = Zlib.crc32(input_data)
compressed = Zlib::Deflate.deflate(input_data, Zlib::BEST_COMPRESSION)
# Store checksum and compressed data
[checksum].pack('L') + compressed
end
def self.decompress_with_verification(compressed_data)
stored_checksum = compressed_data[0, 4].unpack1('L')
compressed = compressed_data[4..-1]
decompressed = Zlib::Inflate.inflate(compressed)
calculated_checksum = Zlib.crc32(decompressed)
if stored_checksum != calculated_checksum
raise "Checksum mismatch: data corruption detected"
end
decompressed
end
end
# Example with binary data
binary_data = 10000.times.map { rand(256) }.pack('C*')
compressed = BinaryCompressor.compress_with_checksum(binary_data)
puts "Original: #{binary_data.bytesize} bytes"
puts "Compressed: #{compressed.bytesize} bytes"
restored = BinaryCompressor.decompress_with_verification(compressed)
puts "Verification: #{binary_data == restored}"
Archive creation combines multiple files into a single compressed container. This pattern appears in backup systems, software distribution, and file transfer utilities. The archive format stores file metadata alongside compressed content.
require 'zlib'
require 'json'
# Simple archive format for multiple files
class SimpleArchive
def self.create(output_file, files)
File.open(output_file, 'wb') do |archive|
manifest = []
files.each do |input_file|
data = File.read(input_file, mode: 'rb')
compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
manifest << {
name: File.basename(input_file),
original_size: data.bytesize,
compressed_size: compressed.bytesize,
offset: archive.pos
}
archive.write(compressed)
end
# Write manifest at end
manifest_json = JSON.generate(manifest)
archive.write([manifest_json.bytesize].pack('Q'))
archive.write(manifest_json)
end
end
def self.extract(archive_file, output_dir)
File.open(archive_file, 'rb') do |archive|
# Read manifest from end
archive.seek(-8, IO::SEEK_END)
manifest_size = archive.read(8).unpack1('Q')
archive.seek(-(8 + manifest_size), IO::SEEK_END)
manifest = JSON.parse(archive.read(manifest_size))
# Extract each file
manifest.each do |entry|
archive.seek(entry['offset'])
compressed = archive.read(entry['compressed_size'])
data = Zlib::Inflate.inflate(compressed)
output_path = File.join(output_dir, entry['name'])
File.write(output_path, data, mode: 'wb')
puts "Extracted: #{entry['name']} (#{data.bytesize} bytes)"
end
end
end
end
Performance Considerations
Compression speed depends on algorithm complexity and data characteristics. Dictionary-based algorithms like DEFLATE perform more work per byte than simple schemes like run-length encoding. The compression level parameter trades CPU time for output size, with higher levels requiring exponentially more computation for diminishing returns.
Memory usage varies significantly between algorithms. Streaming algorithms process fixed-size windows, maintaining constant memory requirements regardless of input size. Dictionary-based methods require memory proportional to window size, typically 32KB for DEFLATE. Block-based algorithms may buffer entire blocks before processing.
Decompression generally executes faster than compression because it performs lookup operations rather than search operations. Dictionary-based decompression reads backward references and copies data, avoiding the pattern matching required during compression. This asymmetry makes compression suitable for write-once, read-many scenarios.
require 'zlib'
require 'benchmark'
# Compare compression levels
data = File.read('/usr/share/dict/words')
results = {}
Benchmark.bm(30) do |x|
(0..9).each do |level|
time = x.report("Compression level #{level}:") do
compressed = Zlib::Deflate.deflate(data, level)
results[level] = {
size: compressed.bytesize,
ratio: compressed.bytesize.to_f / data.bytesize
}
end
end
x.report("Decompression:") do
compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
5.times { Zlib::Inflate.inflate(compressed) }
end
end
puts "\nSize analysis:"
results.each do |level, stats|
puts "Level #{level}: #{stats[:size]} bytes (#{(stats[:ratio] * 100).round(2)}%)"
end
Parallel compression improves throughput by processing independent data segments concurrently. Block-based formats divide input into chunks that compress independently, enabling multi-threaded compression without coordination overhead. The PIGZ format extends GZIP with parallel compression while maintaining format compatibility.
require 'zlib'
require 'thread'
# Parallel compression of independent blocks
class ParallelCompressor
def self.compress(data, num_threads: 4, block_size: 1_048_576)
blocks = []
offset = 0
while offset < data.bytesize
blocks << data[offset, block_size]
offset += block_size
end
compressed_blocks = Array.new(blocks.size)
queue = Queue.new
blocks.each_with_index { |block, idx| queue << [block, idx] }
threads = num_threads.times.map do
Thread.new do
while !queue.empty?
block, idx = queue.pop(true) rescue nil
next unless block
compressed_blocks[idx] = {
data: Zlib::Deflate.deflate(block, Zlib::BEST_SPEED),
original_size: block.bytesize
}
end
end
end
threads.each(&:join)
compressed_blocks
end
end
# Example usage
large_data = "Sample text block " * 100000
result = ParallelCompressor.compress(large_data, num_threads: 4)
total_compressed = result.sum { |block| block[:data].bytesize }
total_original = result.sum { |block| block[:original_size] }
puts "Compressed #{result.size} blocks in parallel"
puts "Total compression: #{total_original} → #{total_compressed} bytes"
Cache efficiency affects compression performance significantly. Sequential memory access patterns enable CPU prefetching and cache line utilization. Random access patterns during dictionary lookups cause cache misses. Buffer sizing impacts cache behavior, with 4KB to 16KB buffers typically optimal for L1/L2 cache utilization.
Compression trade-offs balance multiple objectives: compression ratio, compression speed, decompression speed, and memory usage. No single algorithm optimizes all dimensions simultaneously. Applications select algorithms based on their specific performance requirements and resource constraints.
Common Patterns
The compress-once, decompress-many pattern optimizes for read-heavy workloads. Applications compress data during writes and store the compressed form, decompressing on each read. This trade-off reduces storage costs and network bandwidth at the expense of CPU during reads. Static content delivery and backup systems commonly apply this pattern.
require 'zlib'
# Compressed cache implementation
class CompressedCache
def initialize
@cache = {}
end
def set(key, value)
compressed = Zlib::Deflate.deflate(value, Zlib::BEST_COMPRESSION)
@cache[key] = {
data: compressed,
original_size: value.bytesize,
compressed_size: compressed.bytesize
}
end
def get(key)
entry = @cache[key]
return nil unless entry
Zlib::Inflate.inflate(entry[:data])
end
def stats
total_original = @cache.values.sum { |e| e[:original_size] }
total_compressed = @cache.values.sum { |e| e[:compressed_size] }
{
entries: @cache.size,
original_size: total_original,
compressed_size: total_compressed,
ratio: total_compressed.to_f / total_original
}
end
end
cache = CompressedCache.new
cache.set('doc1', "Large document " * 1000)
cache.set('doc2', "Another document " * 1000)
puts "Retrieved: #{cache.get('doc1')[0..50]}..."
puts "Cache stats: #{cache.stats}"
Transparent compression integrates compression into I/O operations invisibly to application code. Wrapper classes intercept read and write operations, applying compression automatically. This pattern simplifies application code while providing compression benefits without modifying business logic.
require 'zlib'
# Transparent compression for file operations
class CompressedFile
def initialize(path, mode)
@path = path
@mode = mode
@file = nil
@compressor = nil
end
def open
if @mode.include?('w')
@file = File.open(@path, 'wb')
@compressor = Zlib::GzipWriter.new(@file)
else
@file = File.open(@path, 'rb')
@compressor = Zlib::GzipReader.new(@file)
end
end
def write(data)
@compressor.write(data)
end
def read(length = nil)
@compressor.read(length)
end
def close
@compressor.close
@file.close
end
def self.open(path, mode)
file = new(path, mode)
file.open
if block_given?
begin
yield file
ensure
file.close
end
else
file
end
end
end
# Usage identical to regular File operations
CompressedFile.open('data.gz', 'w') do |f|
f.write("Transparently compressed data\n")
f.write("Application code unchanged\n")
end
CompressedFile.open('data.gz', 'r') do |f|
puts f.read
end
Adaptive compression selects algorithms or parameters based on data characteristics detected during processing. Analysis of initial data blocks determines optimal compression strategy. This pattern maximizes compression effectiveness across heterogeneous data sets.
require 'zlib'
# Adaptive compression based on data analysis
class AdaptiveCompressor
SAMPLE_SIZE = 4096
def self.analyze_data(data)
sample = data[0, SAMPLE_SIZE]
unique_bytes = sample.bytes.uniq.size
# Calculate entropy estimate
frequencies = Hash.new(0)
sample.each_byte { |b| frequencies[b] += 1 }
entropy = frequencies.values.reduce(0.0) do |sum, freq|
p = freq.to_f / sample.bytesize
sum - (p * Math.log2(p))
end
{
entropy: entropy,
unique_bytes: unique_bytes,
compressibility: entropy < 6.0 ? :high : :low
}
end
def self.compress(data)
analysis = analyze_data(data)
level = case analysis[:compressibility]
when :high
Zlib::BEST_COMPRESSION
when :low
Zlib::BEST_SPEED
end
compressed = Zlib::Deflate.deflate(data, level)
{
compressed: compressed,
analysis: analysis,
level: level,
ratio: compressed.bytesize.to_f / data.bytesize
}
end
end
# Test with different data types
text_data = "Repeated text " * 1000
random_data = 10000.times.map { rand(256) }.pack('C*')
text_result = AdaptiveCompressor.compress(text_data)
random_result = AdaptiveCompressor.compress(random_data)
puts "Text data:"
puts " Entropy: #{text_result[:analysis][:entropy].round(2)}"
puts " Level: #{text_result[:level]}"
puts " Ratio: #{(text_result[:ratio] * 100).round(2)}%"
puts "\nRandom data:"
puts " Entropy: #{random_result[:analysis][:entropy].round(2)}"
puts " Level: #{random_result[:level]}"
puts " Ratio: #{(random_result[:ratio] * 100).round(2)}%"
Delta compression encodes differences between versions rather than complete content. Version control systems and backup applications use delta compression to minimize storage when data changes incrementally. The technique computes differences between old and new versions, compressing the delta stream.
require 'zlib'
# Simple delta compression for versioned data
class DeltaCompressor
def self.compute_delta(old_data, new_data)
# Simple byte-level delta (production systems use sophisticated diff algorithms)
delta = []
max_length = [old_data.bytesize, new_data.bytesize].max
(0...max_length).each do |i|
old_byte = i < old_data.bytesize ? old_data.getbyte(i) : nil
new_byte = i < new_data.bytesize ? new_data.getbyte(i) : nil
if old_byte != new_byte
delta << { position: i, old: old_byte, new: new_byte }
end
end
delta
end
def self.compress_delta(delta)
# Compress the delta information
delta_json = JSON.generate(delta)
Zlib::Deflate.deflate(delta_json, Zlib::BEST_COMPRESSION)
end
def self.apply_delta(old_data, compressed_delta)
delta_json = Zlib::Inflate.inflate(compressed_delta)
delta = JSON.parse(delta_json)
result = old_data.dup
delta.each do |change|
pos = change['position']
new_byte = change['new']
if new_byte
result.setbyte(pos, new_byte)
end
end
result
end
end
Tools & Ecosystem
The Zlib library provides the foundation for compression in Ruby, implementing DEFLATE compression used by GZIP and ZIP formats. Zlib ships with Ruby as a standard library, requiring no additional installation. The library supports both one-shot and streaming compression with configurable compression levels and memory usage.
Ruby gem ecosystem includes specialized compression libraries for various formats. The bzip2-ruby gem implements BZ2 compression, achieving higher compression ratios than DEFLATE at the cost of slower compression speed. The lz4-ruby gem provides LZ4 compression, optimizing for decompression speed over compression ratio.
require 'zlib'
# Comparing compression libraries
class CompressionBenchmark
def self.compare(data)
results = {}
# Zlib (DEFLATE)
time_start = Time.now
zlib_compressed = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
zlib_time = Time.now - time_start
results[:zlib] = {
size: zlib_compressed.bytesize,
time: zlib_time,
ratio: zlib_compressed.bytesize.to_f / data.bytesize
}
results
end
end
sample_data = File.read('/usr/share/dict/words')
results = CompressionBenchmark.compare(sample_data)
results.each do |lib, stats|
puts "#{lib.capitalize}:"
puts " Compressed size: #{stats[:size]} bytes"
puts " Ratio: #{(stats[:ratio] * 100).round(2)}%"
puts " Time: #{(stats[:time] * 1000).round(2)}ms"
end
Archive manipulation tools handle ZIP, TAR, and other archive formats. The rubyzip gem provides comprehensive ZIP file support, including creation, extraction, and modification of ZIP archives. The minitar gem implements TAR archive handling, commonly combined with GZIP compression for TGZ files.
Command-line integration connects Ruby programs with system compression utilities. The gzip, bzip2, and xz commands provide high-performance compression through system calls. Ruby programs invoke these utilities via backticks, system calls, or Open3 for advanced I/O handling.
require 'open3'
# Using system compression utilities
class SystemCompressor
def self.gzip_compress(input_file, output_file)
stdout, stderr, status = Open3.capture3(
'gzip', '-9', '-c', input_file
)
if status.success?
File.write(output_file, stdout, mode: 'wb')
{ success: true, size: stdout.bytesize }
else
{ success: false, error: stderr }
end
end
def self.compare_compression(file_path)
original_size = File.size(file_path)
results = {}
['gzip', 'bzip2', 'xz'].each do |utility|
temp_output = "/tmp/compressed.#{utility}"
stdout, stderr, status = Open3.capture3(
utility, '-9', '-c', file_path
)
if status.success?
File.write(temp_output, stdout, mode: 'wb')
compressed_size = File.size(temp_output)
results[utility] = {
size: compressed_size,
ratio: compressed_size.to_f / original_size
}
File.delete(temp_output)
end
end
results
end
end
HTTP compression middleware applies transparent compression to web responses. The Rack::Deflater middleware compresses HTTP responses using GZIP when clients indicate support through Accept-Encoding headers. This reduces bandwidth consumption for web applications without modifying application code.
Database compression reduces storage requirements for large datasets. PostgreSQL supports TOAST compression for large field values. Ruby applications leverage database compression through standard database adapters, with compression occurring transparently at the database layer.
Reference
Compression Algorithm Comparison
| Algorithm | Type | Ratio | Speed | Use Case |
|---|---|---|---|---|
| DEFLATE | Lossless | 2-3x | Medium | General purpose |
| LZ4 | Lossless | 1.5-2x | Fast | Real-time systems |
| BZIP2 | Lossless | 3-4x | Slow | Maximum compression |
| LZMA | Lossless | 3-5x | Very slow | Archives |
| Snappy | Lossless | 1.5-2x | Very fast | Streaming data |
| Zstandard | Lossless | 2-3x | Fast | Modern applications |
| RLE | Lossless | Variable | Very fast | Simple repetitive data |
Ruby Zlib Constants
| Constant | Value | Description |
|---|---|---|
| NO_COMPRESSION | 0 | No compression |
| BEST_SPEED | 1 | Fast compression |
| DEFAULT_COMPRESSION | -1 | Default level (6) |
| BEST_COMPRESSION | 9 | Maximum compression |
| FILTERED | 1 | Filtered data strategy |
| HUFFMAN_ONLY | 2 | Huffman coding only |
| RLE | 3 | Run-length encoding |
| FIXED | 4 | Fixed Huffman codes |
Compression Level Effects
| Level | Ratio Impact | Speed Impact | Typical Use |
|---|---|---|---|
| 0 | No compression | Fastest | Testing, already compressed |
| 1-3 | Low compression | Fast | Real-time streaming |
| 4-6 | Moderate compression | Balanced | General purpose |
| 7-9 | High compression | Slow | Archives, storage |
Common File Format Headers
| Format | Magic Bytes | Extension | Description |
|---|---|---|---|
| GZIP | 1f 8b | .gz | DEFLATE compression |
| ZIP | 50 4b 03 04 | .zip | Archive with DEFLATE |
| BZIP2 | 42 5a 68 | .bz2 | Burrows-Wheeler compression |
| XZ | fd 37 7a 58 5a 00 | .xz | LZMA2 compression |
| ZSTD | 28 b5 2f fd | .zst | Zstandard compression |
Zlib Stream Classes
| Class | Purpose | Usage Pattern |
|---|---|---|
| Zlib::Deflate | Compression stream | Create, write data, finish |
| Zlib::Inflate | Decompression stream | Create, write compressed, finish |
| Zlib::GzipWriter | GZIP file writer | Open, write, close |
| Zlib::GzipReader | GZIP file reader | Open, read, close |
| Zlib::ZStream | Base stream class | Not instantiated directly |
Memory Requirements by Algorithm
| Algorithm | Window Size | Dictionary Size | Typical Memory |
|---|---|---|---|
| DEFLATE | 32 KB | 32 KB | 256-384 KB |
| LZ4 | 64 KB | None | 64 KB |
| BZIP2 | 900 KB | None | 3-4 MB |
| LZMA | 1-16 MB | 1-16 MB | 2-32 MB |
Error Handling Patterns
| Error Type | Detection | Recovery Strategy |
|---|---|---|
| Corrupt data | Checksum mismatch | Reject decompression |
| Truncated stream | Premature EOF | Request retransmission |
| Memory exhaustion | Allocation failure | Reduce buffer size |
| Invalid format | Magic byte check | Identify correct format |
| Version mismatch | Header version | Update library or convert |
Performance Optimization Checklist
| Optimization | Impact | Implementation |
|---|---|---|
| Choose appropriate level | High | Match level to use case |
| Use streaming API | High | Process large files incrementally |
| Enable parallel compression | High | Split into independent blocks |
| Reuse compressor objects | Medium | Avoid repeated initialization |
| Tune buffer sizes | Medium | Align with cache lines |
| Pre-allocate buffers | Low | Reduce allocation overhead |
Common Operations Quick Reference
# Compress string
Zlib::Deflate.deflate(string, level)
# Decompress string
Zlib::Inflate.inflate(compressed)
# Compress file
Zlib::GzipWriter.open(path) { |gz| gz.write(data) }
# Decompress file
Zlib::GzipReader.open(path) { |gz| gz.read }
# Calculate checksum
Zlib.crc32(data)
Zlib.adler32(data)
# Streaming compression
deflater = Zlib::Deflate.new(level)
deflater << chunk
compressed = deflater.finish