Overview
Compression transforms data into a representation that requires fewer bits than the original. The process identifies and eliminates statistical redundancy in data, producing output that occupies less space while preserving the information needed to reconstruct the original.
Software systems apply compression throughout the data lifecycle. Applications compress files before writing to disk, web servers compress responses before transmission, databases compress stored records, backup systems compress archives, and streaming services compress media content. The technique trades computational resources for reduced storage and bandwidth requirements.
Two fundamental compression approaches exist: lossless and lossy. Lossless compression preserves every bit of original data, allowing perfect reconstruction. File formats like ZIP, gzip, and PNG use lossless compression. Lossy compression discards information that humans perceive as less important, achieving higher compression ratios at the cost of fidelity. JPEG images and MP3 audio use lossy compression.
The effectiveness of compression depends on the data's inherent redundancy. Text files compress well because natural language contains predictable patterns. Already-compressed data resists further compression because redundancy has been removed. Random data compresses poorly because no patterns exist to exploit.
Compression operates at different layers within systems. File system compression operates transparently below application level. Application-level compression gives developers explicit control over what gets compressed and when. Protocol-level compression, such as HTTP content encoding, happens during transmission. Database compression reduces storage footprint while maintaining query performance.
Key Principles
Compression algorithms exploit patterns in data to represent information with fewer bits. The fundamental principle involves identifying redundancy and replacing repetitive patterns with shorter representations.
Entropy and Information Theory
Information entropy measures the minimum number of bits needed to represent data. Claude Shannon's information theory defines entropy as the average information content per symbol. Data with high redundancy has low entropy and compresses well. Random data has high entropy and approaches incompressibility.
The compression ratio quantifies effectiveness:
compression_ratio = uncompressed_size / compressed_size
space_savings = (1 - compressed_size / uncompressed_size) × 100%
A 100KB file compressed to 25KB has a compression ratio of 4:1 and 75% space savings.
Lossless Compression Methods
Dictionary-based compression builds a dictionary of repeated sequences and replaces occurrences with shorter references. LZ77 and LZ78 algorithms form the foundation for popular formats like gzip and ZIP. These algorithms maintain a sliding window over recent data and encode new data as references to previous occurrences.
Statistical compression assigns shorter codes to frequently occurring symbols and longer codes to rare symbols. Huffman coding creates an optimal prefix-free code tree based on symbol frequencies. Arithmetic coding achieves better compression by encoding entire messages as single numbers within ranges.
Run-length encoding compresses sequences of identical values by storing the value and count. The sequence "AAAABBBBCCCC" becomes "4A4B4C". This method works well for data with long runs of repeated values, such as simple graphics.
Lossy Compression Methods
Lossy compression discards information based on perceptual models. JPEG compression transforms image data into frequency domain using Discrete Cosine Transform, then quantizes high-frequency components that human vision perceives less accurately. The quantization step permanently discards information.
Audio compression removes frequencies outside human hearing range and applies psychoacoustic models to reduce data in ways that minimize perceived quality loss. Video compression combines spatial compression within frames and temporal compression across frames, removing redundancy between sequential images.
Compression Levels
Most compression algorithms offer configurable levels trading speed for compression ratio. Lower levels compress quickly with moderate ratios. Higher levels compress slowly but achieve better ratios. The relationship between level and ratio follows diminishing returns—level 9 might take twice as long as level 6 while achieving only 5% better compression.
Streaming and Block Compression
Streaming compression processes data sequentially without requiring the entire input in memory. This enables compression of large files and real-time data streams. Block compression divides data into chunks, compressing each independently. Block boundaries enable random access to compressed data and parallel compression, but reduce compression ratio compared to streaming because patterns cannot span blocks.
Ruby Implementation
Ruby provides built-in compression through the Zlib module, which wraps the zlib C library implementing DEFLATE compression. The standard library includes support for gzip format, and gems extend functionality for additional formats.
Basic Compression with Zlib
The Zlib::Deflate class compresses data using DEFLATE algorithm:
require 'zlib'
# Compress a string
original = "This text contains repeated words and repeated patterns that compress well"
compressed = Zlib::Deflate.deflate(original)
puts "Original size: #{original.bytesize} bytes"
puts "Compressed size: #{compressed.bytesize} bytes"
puts "Compression ratio: #{'%.2f' % (original.bytesize.to_f / compressed.bytesize)}"
# => Original size: 75 bytes
# => Compressed size: 54 bytes
# => Compression ratio: 1.39
Decompression reverses the process:
require 'zlib'
compressed = Zlib::Deflate.deflate("Original data")
decompressed = Zlib::Inflate.inflate(compressed)
puts decompressed
# => "Original data"
Compression Levels
Zlib supports nine compression levels, from 0 (no compression) to 9 (maximum compression). Level 6 is the default:
require 'zlib'
data = File.read('large_file.txt')
# Fast compression (level 1)
fast = Zlib::Deflate.deflate(data, Zlib::BEST_SPEED)
# Balanced compression (level 6, default)
balanced = Zlib::Deflate.deflate(data)
# Maximum compression (level 9)
maximum = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)
puts "Fast: #{fast.bytesize} bytes"
puts "Balanced: #{balanced.bytesize} bytes"
puts "Maximum: #{maximum.bytesize} bytes"
Gzip Format
Gzip adds headers and checksums to DEFLATE compressed data. Ruby provides Zlib::GzipWriter and Zlib::GzipReader for gzip format:
require 'zlib'
# Compress to gzip format
Zlib::GzipWriter.open('output.gz') do |gz|
gz.write("Data to compress")
gz.write("More data")
end
# Decompress from gzip format
content = Zlib::GzipReader.open('output.gz') do |gz|
gz.read
end
puts content
# => "Data to compressMore data"
Streaming Compression
For large files, streaming avoids loading entire contents into memory:
require 'zlib'
# Compress file in streaming mode
File.open('large_input.txt', 'rb') do |input|
Zlib::GzipWriter.open('output.gz') do |gz|
while chunk = input.read(8192)
gz.write(chunk)
end
end
end
# Decompress file in streaming mode
File.open('decompressed.txt', 'wb') do |output|
Zlib::GzipReader.open('output.gz') do |gz|
while chunk = gz.read(8192)
output.write(chunk)
end
end
end
In-Memory Compression with StringIO
StringIO provides in-memory streams for compression without file I/O:
require 'zlib'
require 'stringio'
# Compress to memory
compressed = StringIO.new
Zlib::GzipWriter.wrap(compressed) do |gz|
gz.write("Data to compress")
end
compressed_data = compressed.string
# Decompress from memory
decompressed = StringIO.new(compressed_data)
Zlib::GzipReader.wrap(decompressed) do |gz|
puts gz.read
end
# => "Data to compress"
HTTP Content Encoding
Web applications compress HTTP responses transparently:
require 'zlib'
require 'rack'
class CompressionMiddleware
def initialize(app)
@app = app
end
def call(env)
status, headers, body = @app.call(env)
if env['HTTP_ACCEPT_ENCODING']&.include?('gzip')
compressed_body = compress_body(body)
headers['Content-Encoding'] = 'gzip'
headers['Content-Length'] = compressed_body.bytesize.to_s
[status, headers, [compressed_body]]
else
[status, headers, body]
end
end
private
def compress_body(body)
io = StringIO.new
gz = Zlib::GzipWriter.new(io)
body.each { |part| gz.write(part) }
gz.close
io.string
end
end
Implementation Approaches
Different compression algorithms offer distinct characteristics for various use cases. Selecting an appropriate algorithm requires understanding the trade-offs between compression ratio, speed, memory usage, and data access patterns.
DEFLATE Algorithm
DEFLATE combines LZ77 dictionary compression with Huffman coding. The algorithm maintains a 32KB sliding window of recent data and searches for matching sequences. When a match is found, it outputs a length-distance pair instead of the literal bytes. Huffman coding then compresses the length-distance pairs and literals.
DEFLATE balances compression ratio and speed effectively, making it the foundation for ZIP, gzip, and PNG formats. The algorithm performs well on text and structured data with repeated patterns. Compression ratio depends on redundancy in the input—highly repetitive data compresses more than random data.
LZ4 Compression
LZ4 prioritizes decompression speed over compression ratio. The algorithm uses a simpler matching scheme than DEFLATE, producing larger compressed output but decompressing significantly faster. LZ4 suits real-time applications where decompression speed matters more than storage space.
Applications use LZ4 for network protocols, inter-process communication, and scenarios requiring fast random access to compressed data. The algorithm enables compression at rates exceeding memory bandwidth on modern hardware.
Brotli Compression
Brotli, developed by Google, achieves higher compression ratios than gzip for web content. The algorithm uses a larger dictionary (up to 16MB) and more sophisticated matching. Brotli compresses 15-25% better than gzip on typical web content.
Modern browsers support Brotli for HTTP content encoding. The algorithm requires more CPU for compression but decompresses at comparable speeds to gzip. Static content benefits from pre-compression at high levels, while dynamic content uses moderate levels for acceptable compression speed.
Zstandard
Zstandard (zstd) provides configurable trade-offs across a wide range of compression levels. Lower levels compress faster than LZ4 while achieving better ratios. Higher levels approach or exceed brotli compression ratios with faster compression. The algorithm supports dictionary compression for small messages, improving compression of similar small payloads.
Specialized Compression
Certain data types benefit from specialized compression. Columnar compression in databases groups similar data values together, improving compression ratio. Delta encoding stores differences between sequential values, working well for time-series data where consecutive values change incrementally. Bitmap compression applies run-length encoding to sparse bitmaps.
Compression Format Selection
Format selection depends on requirements:
- gzip/DEFLATE: Universal compatibility, balanced ratio and speed, wide library support
- Brotli: Best compression for web content, slower compression, limited to HTTPS
- LZ4: Fastest decompression, moderate compression, good for transient data
- Zstandard: Flexible levels, good compression and speed, modern format
- XZ/LZMA: Highest compression ratios, very slow compression, archival use
Performance Considerations
Compression introduces computational overhead in exchange for reduced data size. Understanding performance characteristics guides decisions about when and how to apply compression.
CPU and Memory Trade-offs
Compression consumes CPU cycles during compression and decompression. The cost varies by algorithm and level. DEFLATE at level 1 compresses 5-10x faster than level 9 but achieves 10-20% lower compression ratio. High compression levels increase memory usage, with some algorithms requiring tens of megabytes for compression state.
Decompression generally requires less CPU than compression. DEFLATE decompression operates at 300-500 MB/s on typical hardware regardless of compression level. This asymmetry makes high compression levels viable when data is compressed once but decompressed many times, such as software distributions and static assets.
Storage and Bandwidth Savings
Compression reduces I/O operations and storage costs. Reading 1MB from disk at 100 MB/s takes 10ms. Reading 250KB compressed data and decompressing at 400 MB/s takes 2.5ms + 0.6ms = 3.1ms, a 3x speedup. The crossover point depends on I/O speed and compression ratio.
Network transmission benefits more dramatically. Transmitting 1MB over a 10 Mbps connection takes 800ms. Transmitting 250KB compressed takes 200ms plus compression/decompression time, typically 5-10ms each. Compression reduces latency by 75% in this scenario.
Compression Level Selection
Profile compression and decompression times for representative data:
require 'zlib'
require 'benchmark'
data = File.read('sample_data.txt')
Benchmark.bmbm do |x|
(1..9).each do |level|
x.report("level #{level}") do
1000.times { Zlib::Deflate.deflate(data, level) }
end
end
end
Typical results show level 1 compressing 5-10x faster than level 9, with level 6 offering a good balance. Level 9 might save 10-15% over level 6 while taking 2-3x longer.
Adaptive Compression
Adapt compression based on runtime conditions:
require 'zlib'
class AdaptiveCompressor
FAST_LEVEL = 1
BALANCED_LEVEL = 6
HIGH_LEVEL = 9
def initialize
@cpu_threshold = 0.7
end
def compress(data)
level = select_level
Zlib::Deflate.deflate(data, level)
end
private
def select_level
cpu_usage = current_cpu_usage
if cpu_usage > @cpu_threshold
FAST_LEVEL
elsif cpu_usage > 0.5
BALANCED_LEVEL
else
HIGH_LEVEL
end
end
def current_cpu_usage
# Implementation depends on platform
# Read from /proc/stat on Linux or similar
0.5
end
end
Compression Threshold
Small data may not benefit from compression. Compression overhead and algorithm headers can exceed savings for small payloads:
require 'zlib'
def should_compress?(data, threshold: 100)
return false if data.bytesize < threshold
# Sample compression to estimate ratio
sample_size = [data.bytesize / 10, 1000].min
sample = data[0, sample_size]
compressed_sample = Zlib::Deflate.deflate(sample, 1)
estimated_ratio = sample.bytesize.to_f / compressed_sample.bytesize
estimated_ratio > 1.1 # Require 10% savings
end
def compress_if_beneficial(data)
if should_compress?(data)
Zlib::Deflate.deflate(data)
else
data
end
end
Parallel Compression
Divide large files into blocks for parallel compression:
require 'zlib'
require 'parallel'
def parallel_compress(input_file, output_file, block_size: 1024 * 1024)
blocks = []
File.open(input_file, 'rb') do |f|
while chunk = f.read(block_size)
blocks << chunk
end
end
compressed_blocks = Parallel.map(blocks) do |block|
Zlib::Deflate.deflate(block, 6)
end
File.open(output_file, 'wb') do |f|
compressed_blocks.each do |block|
f.write([block.bytesize].pack('N'))
f.write(block)
end
end
end
Practical Examples
Real-world scenarios demonstrate compression integration in applications.
Compressing Log Files
Applications generate large log files that compress efficiently:
require 'zlib'
require 'logger'
class CompressingLogger
def initialize(base_path, compress_after_bytes: 10 * 1024 * 1024)
@base_path = base_path
@compress_after_bytes = compress_after_bytes
@current_file = open_new_file
@logger = Logger.new(@current_file)
end
def info(message)
@logger.info(message)
rotate_if_needed
end
def error(message)
@logger.error(message)
rotate_if_needed
end
private
def open_new_file
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
File.open("#{@base_path}/app_#{timestamp}.log", 'a')
end
def rotate_if_needed
return if @current_file.size < @compress_after_bytes
@logger.close
compress_current_file
@current_file = open_new_file
@logger = Logger.new(@current_file)
end
def compress_current_file
input_path = @current_file.path
output_path = "#{input_path}.gz"
Zlib::GzipWriter.open(output_path) do |gz|
File.open(input_path, 'rb') do |input|
while chunk = input.read(8192)
gz.write(chunk)
end
end
end
File.delete(input_path)
end
end
logger = CompressingLogger.new('/var/log/myapp')
logger.info("Application started")
API Response Compression
Compress API responses exceeding a size threshold:
require 'sinatra'
require 'json'
require 'zlib'
class CompressionHelper
MIN_SIZE = 1000
def self.compress_response?(request, content)
return false if content.bytesize < MIN_SIZE
return false unless request.env['HTTP_ACCEPT_ENCODING']
encodings = request.env['HTTP_ACCEPT_ENCODING'].split(',').map(&:strip)
encodings.include?('gzip')
end
def self.compress(content)
io = StringIO.new
gz = Zlib::GzipWriter.new(io, Zlib::BEST_SPEED)
gz.write(content)
gz.close
io.string
end
end
get '/api/data' do
data = {
records: Array.new(1000) { |i| { id: i, value: "Record #{i}" } }
}
json_content = JSON.generate(data)
if CompressionHelper.compress_response?(request, json_content)
compressed = CompressionHelper.compress(json_content)
headers 'Content-Encoding' => 'gzip',
'Content-Length' => compressed.bytesize.to_s,
'Content-Type' => 'application/json'
compressed
else
content_type :json
json_content
end
end
Database Column Compression
Compress large text columns before storage:
require 'zlib'
require 'active_record'
class Article < ActiveRecord::Base
before_save :compress_content
after_find :decompress_content
attr_accessor :content_uncompressed
def content
@content_uncompressed || decompress_stored_content
end
def content=(value)
@content_uncompressed = value
end
private
def compress_content
return unless @content_uncompressed
if @content_uncompressed.bytesize > 1000
self.content_compressed = true
self.content_data = Zlib::Deflate.deflate(@content_uncompressed)
else
self.content_compressed = false
self.content_data = @content_uncompressed
end
end
def decompress_content
return unless content_data
@content_uncompressed = decompress_stored_content
end
def decompress_stored_content
if content_compressed
Zlib::Inflate.inflate(content_data)
else
content_data
end
end
end
Caching Compressed Content
Cache compressed responses for repeated requests:
require 'zlib'
require 'digest'
class CompressedCache
def initialize
@cache = {}
end
def fetch(key, &block)
cache_key = cache_key_for(key)
if @cache.key?(cache_key)
decompress(@cache[cache_key])
else
content = block.call
@cache[cache_key] = compress(content)
content
end
end
private
def cache_key_for(key)
Digest::SHA256.hexdigest(key.to_s)
end
def compress(data)
Zlib::Deflate.deflate(data, Zlib::BEST_SPEED)
end
def decompress(data)
Zlib::Inflate.inflate(data)
end
end
cache = CompressedCache.new
response = cache.fetch('expensive_query') do
# Execute expensive operation
generate_large_report
end
Common Patterns
Several patterns establish standard practices for compression integration.
Transparent Compression
Implement compression transparently within infrastructure layers:
require 'zlib'
class TransparentStorage
def initialize(backend)
@backend = backend
end
def write(key, data)
compressed = Zlib::Deflate.deflate(data)
@backend.write(key, compressed)
@backend.write("#{key}.meta", { compressed: true }.to_json)
end
def read(key)
meta = JSON.parse(@backend.read("#{key}.meta"))
data = @backend.read(key)
if meta['compressed']
Zlib::Inflate.inflate(data)
else
data
end
end
end
storage = TransparentStorage.new(file_backend)
storage.write('document', large_document)
document = storage.read('document')
Conditional Compression by Content Type
Apply compression selectively based on content characteristics:
require 'zlib'
class ContentTypeCompressor
COMPRESSIBLE_TYPES = %w[
text/plain
text/html
text/css
text/javascript
application/json
application/xml
].freeze
INCOMPRESSIBLE_EXTENSIONS = %w[
.jpg .jpeg .png .gif .mp3 .mp4 .zip .gz
].freeze
def compress?(content_type, filename)
return false if already_compressed?(filename)
compressible_type?(content_type)
end
private
def already_compressed?(filename)
INCOMPRESSIBLE_EXTENSIONS.any? { |ext| filename.end_with?(ext) }
end
def compressible_type?(content_type)
COMPRESSIBLE_TYPES.any? { |type| content_type.start_with?(type) }
end
end
compressor = ContentTypeCompressor.new
if compressor.compress?('text/html', 'page.html')
compressed = Zlib::Deflate.deflate(content)
end
Progressive Compression
Compress data progressively as it streams:
require 'zlib'
class ProgressiveCompressor
def initialize(output)
@deflater = Zlib::Deflate.new
@output = output
end
def write(chunk)
compressed = @deflater.deflate(chunk)
@output.write(compressed) unless compressed.empty?
end
def finish
final = @deflater.finish
@output.write(final)
@deflater.close
end
end
File.open('output.deflate', 'wb') do |output|
compressor = ProgressiveCompressor.new(output)
File.open('large_input.txt', 'rb') do |input|
while chunk = input.read(8192)
compressor.write(chunk)
end
end
compressor.finish
end
Compression with Error Recovery
Handle compression errors gracefully:
require 'zlib'
class SafeCompressor
class CompressionError < StandardError; end
def compress(data, fallback: :original)
compressed = Zlib::Deflate.deflate(data)
# Verify compression worked
decompressed = Zlib::Inflate.inflate(compressed)
raise CompressionError unless decompressed == data
compressed
rescue Zlib::Error, CompressionError => e
handle_error(e, data, fallback)
end
private
def handle_error(error, data, fallback)
case fallback
when :original
data
when :raise
raise error
when Proc
fallback.call(data, error)
end
end
end
compressor = SafeCompressor.new
result = compressor.compress(data, fallback: :original)
Reference
Ruby Compression Methods
| Class/Module | Method | Description |
|---|---|---|
| Zlib::Deflate | deflate(string, level) | Compress string with specified level |
| Zlib::Inflate | inflate(string) | Decompress deflated string |
| Zlib::GzipWriter | new(io, level) | Create gzip writer for output stream |
| Zlib::GzipWriter | write(data) | Write data to compressed stream |
| Zlib::GzipReader | new(io) | Create gzip reader for input stream |
| Zlib::GzipReader | read(length) | Read decompressed data from stream |
| Zlib::GzipFile | open(filename, mode) | Open gzip file for reading or writing |
Compression Levels
| Level | Constant | Speed | Ratio | Use Case |
|---|---|---|---|---|
| 0 | NO_COMPRESSION | Fastest | None | No compression needed |
| 1 | BEST_SPEED | Very fast | Low | Real-time compression |
| 2-5 | - | Fast | Moderate | Network transmission |
| 6 | DEFAULT_COMPRESSION | Balanced | Good | General purpose |
| 7-8 | - | Slow | Better | Storage optimization |
| 9 | BEST_COMPRESSION | Slowest | Best | Archival storage |
Algorithm Comparison
| Algorithm | Compression Ratio | Compression Speed | Decompression Speed | Memory Usage | Use Case |
|---|---|---|---|---|---|
| DEFLATE/gzip | Good | Moderate | Fast | Low | General purpose |
| LZ4 | Moderate | Very fast | Very fast | Very low | Real-time processing |
| Brotli | Excellent | Slow | Fast | High | Static web content |
| Zstandard | Excellent | Fast | Fast | Moderate | Modern applications |
| LZMA/XZ | Excellent | Very slow | Moderate | Very high | Archival compression |
Compression Decision Matrix
| Data Size | Compression Frequency | Access Pattern | Recommended Approach |
|---|---|---|---|
| < 100 bytes | Any | Any | Skip compression |
| 100 bytes - 10 KB | Frequent | Sequential | Level 1 DEFLATE |
| 10 KB - 1 MB | Frequent | Sequential | Level 3-6 DEFLATE |
| 10 KB - 1 MB | Infrequent | Random | Block compression |
| > 1 MB | Frequent | Sequential | Level 6 DEFLATE |
| > 1 MB | Infrequent | Sequential | Level 9 DEFLATE |
| > 1 MB | Any | Random | Block compression |
Format Extensions
| Extension | Format | Algorithm | Header | Checksum |
|---|---|---|---|---|
| .gz | gzip | DEFLATE | Yes | CRC32 |
| .zip | ZIP | DEFLATE | Yes | CRC32 |
| .zst | Zstandard | Zstandard | Yes | XXH64 |
| .lz4 | LZ4 | LZ4 | Yes | XXH32 |
| .br | Brotli | Brotli | Minimal | None |
| .xz | XZ | LZMA2 | Yes | CRC64 |
Performance Benchmarks
Typical throughput on modern hardware (MB/s):
| Algorithm | Level | Compression | Decompression |
|---|---|---|---|
| DEFLATE | 1 | 200-300 | 300-500 |
| DEFLATE | 6 | 50-100 | 300-500 |
| DEFLATE | 9 | 20-30 | 300-500 |
| LZ4 | 1 | 400-600 | 2000-3000 |
| Brotli | 6 | 10-20 | 300-400 |
| Zstandard | 3 | 300-400 | 600-800 |
| Zstandard | 19 | 5-10 | 600-800 |