CrackedRuby - Compression

Overview

Compression transforms data into a representation that requires fewer bits than the original. The process identifies and eliminates statistical redundancy in data, producing output that occupies less space while preserving the information needed to reconstruct the original.

Software systems apply compression throughout the data lifecycle. Applications compress files before writing to disk, web servers compress responses before transmission, databases compress stored records, backup systems compress archives, and streaming services compress media content. The technique trades computational resources for reduced storage and bandwidth requirements.

Two fundamental compression approaches exist: lossless and lossy. Lossless compression preserves every bit of original data, allowing perfect reconstruction. File formats like ZIP, gzip, and PNG use lossless compression. Lossy compression discards information that humans perceive as less important, achieving higher compression ratios at the cost of fidelity. JPEG images and MP3 audio use lossy compression.

The effectiveness of compression depends on the data's inherent redundancy. Text files compress well because natural language contains predictable patterns. Already-compressed data resists further compression because redundancy has been removed. Random data compresses poorly because no patterns exist to exploit.

Compression operates at different layers within systems. File system compression operates transparently below application level. Application-level compression gives developers explicit control over what gets compressed and when. Protocol-level compression, such as HTTP content encoding, happens during transmission. Database compression reduces storage footprint while maintaining query performance.

Key Principles

Compression algorithms exploit patterns in data to represent information with fewer bits. The fundamental principle involves identifying redundancy and replacing repetitive patterns with shorter representations.

Entropy and Information Theory

Information entropy measures the minimum number of bits needed to represent data. Claude Shannon's information theory defines entropy as the average information content per symbol. Data with high redundancy has low entropy and compresses well. Random data has high entropy and approaches incompressibility.

The compression ratio quantifies effectiveness:

compression_ratio = uncompressed_size / compressed_size
space_savings = (1 - compressed_size / uncompressed_size) × 100%

A 100KB file compressed to 25KB has a compression ratio of 4:1 and 75% space savings.

Lossless Compression Methods

Dictionary-based compression builds a dictionary of repeated sequences and replaces occurrences with shorter references. LZ77 and LZ78 algorithms form the foundation for popular formats like gzip and ZIP. These algorithms maintain a sliding window over recent data and encode new data as references to previous occurrences.

Statistical compression assigns shorter codes to frequently occurring symbols and longer codes to rare symbols. Huffman coding creates an optimal prefix-free code tree based on symbol frequencies. Arithmetic coding achieves better compression by encoding entire messages as single numbers within ranges.

Run-length encoding compresses sequences of identical values by storing the value and count. The sequence "AAAABBBBCCCC" becomes "4A4B4C". This method works well for data with long runs of repeated values, such as simple graphics.

Lossy Compression Methods

Lossy compression discards information based on perceptual models. JPEG compression transforms image data into frequency domain using Discrete Cosine Transform, then quantizes high-frequency components that human vision perceives less accurately. The quantization step permanently discards information.

Audio compression removes frequencies outside human hearing range and applies psychoacoustic models to reduce data in ways that minimize perceived quality loss. Video compression combines spatial compression within frames and temporal compression across frames, removing redundancy between sequential images.

Compression Levels

Most compression algorithms offer configurable levels trading speed for compression ratio. Lower levels compress quickly with moderate ratios. Higher levels compress slowly but achieve better ratios. The relationship between level and ratio follows diminishing returns—level 9 might take twice as long as level 6 while achieving only 5% better compression.

Streaming and Block Compression

Streaming compression processes data sequentially without requiring the entire input in memory. This enables compression of large files and real-time data streams. Block compression divides data into chunks, compressing each independently. Block boundaries enable random access to compressed data and parallel compression, but reduce compression ratio compared to streaming because patterns cannot span blocks.

Ruby Implementation

Ruby provides built-in compression through the Zlib module, which wraps the zlib C library implementing DEFLATE compression. The standard library includes support for gzip format, and gems extend functionality for additional formats.

Basic Compression with Zlib

The Zlib::Deflate class compresses data using DEFLATE algorithm:

require 'zlib'

# Compress a string
original = "This text contains repeated words and repeated patterns that compress well"
compressed = Zlib::Deflate.deflate(original)

puts "Original size: #{original.bytesize} bytes"
puts "Compressed size: #{compressed.bytesize} bytes"
puts "Compression ratio: #{'%.2f' % (original.bytesize.to_f / compressed.bytesize)}"
# => Original size: 75 bytes
# => Compressed size: 54 bytes
# => Compression ratio: 1.39

Decompression reverses the process:

require 'zlib'

compressed = Zlib::Deflate.deflate("Original data")
decompressed = Zlib::Inflate.inflate(compressed)

puts decompressed
# => "Original data"

Compression Levels

Zlib supports nine compression levels, from 0 (no compression) to 9 (maximum compression). Level 6 is the default:

require 'zlib'

data = File.read('large_file.txt')

# Fast compression (level 1)
fast = Zlib::Deflate.deflate(data, Zlib::BEST_SPEED)

# Balanced compression (level 6, default)
balanced = Zlib::Deflate.deflate(data)

# Maximum compression (level 9)
maximum = Zlib::Deflate.deflate(data, Zlib::BEST_COMPRESSION)

puts "Fast: #{fast.bytesize} bytes"
puts "Balanced: #{balanced.bytesize} bytes"
puts "Maximum: #{maximum.bytesize} bytes"

Gzip Format

Gzip adds headers and checksums to DEFLATE compressed data. Ruby provides Zlib::GzipWriter and Zlib::GzipReader for gzip format:

require 'zlib'

# Compress to gzip format
Zlib::GzipWriter.open('output.gz') do |gz|
  gz.write("Data to compress")
  gz.write("More data")
end

# Decompress from gzip format
content = Zlib::GzipReader.open('output.gz') do |gz|
  gz.read
end

puts content
# => "Data to compressMore data"

Streaming Compression

For large files, streaming avoids loading entire contents into memory:

require 'zlib'

# Compress file in streaming mode
File.open('large_input.txt', 'rb') do |input|
  Zlib::GzipWriter.open('output.gz') do |gz|
    while chunk = input.read(8192)
      gz.write(chunk)
    end
  end
end

# Decompress file in streaming mode
File.open('decompressed.txt', 'wb') do |output|
  Zlib::GzipReader.open('output.gz') do |gz|
    while chunk = gz.read(8192)
      output.write(chunk)
    end
  end
end

In-Memory Compression with StringIO

StringIO provides in-memory streams for compression without file I/O:

require 'zlib'
require 'stringio'

# Compress to memory
compressed = StringIO.new
Zlib::GzipWriter.wrap(compressed) do |gz|
  gz.write("Data to compress")
end

compressed_data = compressed.string

# Decompress from memory
decompressed = StringIO.new(compressed_data)
Zlib::GzipReader.wrap(decompressed) do |gz|
  puts gz.read
end
# => "Data to compress"

HTTP Content Encoding

Web applications compress HTTP responses transparently:

require 'zlib'
require 'rack'

class CompressionMiddleware
  def initialize(app)
    @app = app
  end

  def call(env)
    status, headers, body = @app.call(env)
    
    if env['HTTP_ACCEPT_ENCODING']&.include?('gzip')
      compressed_body = compress_body(body)
      headers['Content-Encoding'] = 'gzip'
      headers['Content-Length'] = compressed_body.bytesize.to_s
      [status, headers, [compressed_body]]
    else
      [status, headers, body]
    end
  end

  private

  def compress_body(body)
    io = StringIO.new
    gz = Zlib::GzipWriter.new(io)
    body.each { |part| gz.write(part) }
    gz.close
    io.string
  end
end

Implementation Approaches

Different compression algorithms offer distinct characteristics for various use cases. Selecting an appropriate algorithm requires understanding the trade-offs between compression ratio, speed, memory usage, and data access patterns.

DEFLATE Algorithm

DEFLATE combines LZ77 dictionary compression with Huffman coding. The algorithm maintains a 32KB sliding window of recent data and searches for matching sequences. When a match is found, it outputs a length-distance pair instead of the literal bytes. Huffman coding then compresses the length-distance pairs and literals.

DEFLATE balances compression ratio and speed effectively, making it the foundation for ZIP, gzip, and PNG formats. The algorithm performs well on text and structured data with repeated patterns. Compression ratio depends on redundancy in the input—highly repetitive data compresses more than random data.

LZ4 Compression

LZ4 prioritizes decompression speed over compression ratio. The algorithm uses a simpler matching scheme than DEFLATE, producing larger compressed output but decompressing significantly faster. LZ4 suits real-time applications where decompression speed matters more than storage space.

Applications use LZ4 for network protocols, inter-process communication, and scenarios requiring fast random access to compressed data. The algorithm enables compression at rates exceeding memory bandwidth on modern hardware.

Brotli Compression

Brotli, developed by Google, achieves higher compression ratios than gzip for web content. The algorithm uses a larger dictionary (up to 16MB) and more sophisticated matching. Brotli compresses 15-25% better than gzip on typical web content.

Modern browsers support Brotli for HTTP content encoding. The algorithm requires more CPU for compression but decompresses at comparable speeds to gzip. Static content benefits from pre-compression at high levels, while dynamic content uses moderate levels for acceptable compression speed.

Zstandard

Zstandard (zstd) provides configurable trade-offs across a wide range of compression levels. Lower levels compress faster than LZ4 while achieving better ratios. Higher levels approach or exceed brotli compression ratios with faster compression. The algorithm supports dictionary compression for small messages, improving compression of similar small payloads.

Specialized Compression

Certain data types benefit from specialized compression. Columnar compression in databases groups similar data values together, improving compression ratio. Delta encoding stores differences between sequential values, working well for time-series data where consecutive values change incrementally. Bitmap compression applies run-length encoding to sparse bitmaps.

Compression Format Selection

Format selection depends on requirements:

gzip/DEFLATE: Universal compatibility, balanced ratio and speed, wide library support
Brotli: Best compression for web content, slower compression, limited to HTTPS
LZ4: Fastest decompression, moderate compression, good for transient data
Zstandard: Flexible levels, good compression and speed, modern format
XZ/LZMA: Highest compression ratios, very slow compression, archival use

Performance Considerations

Compression introduces computational overhead in exchange for reduced data size. Understanding performance characteristics guides decisions about when and how to apply compression.

CPU and Memory Trade-offs

Compression consumes CPU cycles during compression and decompression. The cost varies by algorithm and level. DEFLATE at level 1 compresses 5-10x faster than level 9 but achieves 10-20% lower compression ratio. High compression levels increase memory usage, with some algorithms requiring tens of megabytes for compression state.

Decompression generally requires less CPU than compression. DEFLATE decompression operates at 300-500 MB/s on typical hardware regardless of compression level. This asymmetry makes high compression levels viable when data is compressed once but decompressed many times, such as software distributions and static assets.

Storage and Bandwidth Savings

Compression reduces I/O operations and storage costs. Reading 1MB from disk at 100 MB/s takes 10ms. Reading 250KB compressed data and decompressing at 400 MB/s takes 2.5ms + 0.6ms = 3.1ms, a 3x speedup. The crossover point depends on I/O speed and compression ratio.

Network transmission benefits more dramatically. Transmitting 1MB over a 10 Mbps connection takes 800ms. Transmitting 250KB compressed takes 200ms plus compression/decompression time, typically 5-10ms each. Compression reduces latency by 75% in this scenario.

Compression Level Selection

Profile compression and decompression times for representative data:

require 'zlib'
require 'benchmark'

data = File.read('sample_data.txt')

Benchmark.bmbm do |x|
  (1..9).each do |level|
    x.report("level #{level}") do
      1000.times { Zlib::Deflate.deflate(data, level) }
    end
  end
end

Typical results show level 1 compressing 5-10x faster than level 9, with level 6 offering a good balance. Level 9 might save 10-15% over level 6 while taking 2-3x longer.

Adaptive Compression

Adapt compression based on runtime conditions:

require 'zlib'

class AdaptiveCompressor
  FAST_LEVEL = 1
  BALANCED_LEVEL = 6
  HIGH_LEVEL = 9
  
  def initialize
    @cpu_threshold = 0.7
  end
  
  def compress(data)
    level = select_level
    Zlib::Deflate.deflate(data, level)
  end
  
  private
  
  def select_level
    cpu_usage = current_cpu_usage
    
    if cpu_usage > @cpu_threshold
      FAST_LEVEL
    elsif cpu_usage > 0.5
      BALANCED_LEVEL
    else
      HIGH_LEVEL
    end
  end
  
  def current_cpu_usage
    # Implementation depends on platform
    # Read from /proc/stat on Linux or similar
    0.5
  end
end

Compression Threshold

Small data may not benefit from compression. Compression overhead and algorithm headers can exceed savings for small payloads:

require 'zlib'

def should_compress?(data, threshold: 100)
  return false if data.bytesize < threshold
  
  # Sample compression to estimate ratio
  sample_size = [data.bytesize / 10, 1000].min
  sample = data[0, sample_size]
  compressed_sample = Zlib::Deflate.deflate(sample, 1)
  
  estimated_ratio = sample.bytesize.to_f / compressed_sample.bytesize
  estimated_ratio > 1.1  # Require 10% savings
end

def compress_if_beneficial(data)
  if should_compress?(data)
    Zlib::Deflate.deflate(data)
  else
    data
  end
end

Parallel Compression

Divide large files into blocks for parallel compression:

require 'zlib'
require 'parallel'

def parallel_compress(input_file, output_file, block_size: 1024 * 1024)
  blocks = []
  File.open(input_file, 'rb') do |f|
    while chunk = f.read(block_size)
      blocks << chunk
    end
  end
  
  compressed_blocks = Parallel.map(blocks) do |block|
    Zlib::Deflate.deflate(block, 6)
  end
  
  File.open(output_file, 'wb') do |f|
    compressed_blocks.each do |block|
      f.write([block.bytesize].pack('N'))
      f.write(block)
    end
  end
end

Practical Examples

Real-world scenarios demonstrate compression integration in applications.

Compressing Log Files

Applications generate large log files that compress efficiently:

require 'zlib'
require 'logger'

class CompressingLogger
  def initialize(base_path, compress_after_bytes: 10 * 1024 * 1024)
    @base_path = base_path
    @compress_after_bytes = compress_after_bytes
    @current_file = open_new_file
    @logger = Logger.new(@current_file)
  end
  
  def info(message)
    @logger.info(message)
    rotate_if_needed
  end
  
  def error(message)
    @logger.error(message)
    rotate_if_needed
  end
  
  private
  
  def open_new_file
    timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
    File.open("#{@base_path}/app_#{timestamp}.log", 'a')
  end
  
  def rotate_if_needed
    return if @current_file.size < @compress_after_bytes
    
    @logger.close
    compress_current_file
    @current_file = open_new_file
    @logger = Logger.new(@current_file)
  end
  
  def compress_current_file
    input_path = @current_file.path
    output_path = "#{input_path}.gz"
    
    Zlib::GzipWriter.open(output_path) do |gz|
      File.open(input_path, 'rb') do |input|
        while chunk = input.read(8192)
          gz.write(chunk)
        end
      end
    end
    
    File.delete(input_path)
  end
end

logger = CompressingLogger.new('/var/log/myapp')
logger.info("Application started")

API Response Compression

Compress API responses exceeding a size threshold:

require 'sinatra'
require 'json'
require 'zlib'

class CompressionHelper
  MIN_SIZE = 1000
  
  def self.compress_response?(request, content)
    return false if content.bytesize < MIN_SIZE
    return false unless request.env['HTTP_ACCEPT_ENCODING']
    
    encodings = request.env['HTTP_ACCEPT_ENCODING'].split(',').map(&:strip)
    encodings.include?('gzip')
  end
  
  def self.compress(content)
    io = StringIO.new
    gz = Zlib::GzipWriter.new(io, Zlib::BEST_SPEED)
    gz.write(content)
    gz.close
    io.string
  end
end

get '/api/data' do
  data = {
    records: Array.new(1000) { |i| { id: i, value: "Record #{i}" } }
  }
  
  json_content = JSON.generate(data)
  
  if CompressionHelper.compress_response?(request, json_content)
    compressed = CompressionHelper.compress(json_content)
    headers 'Content-Encoding' => 'gzip',
            'Content-Length' => compressed.bytesize.to_s,
            'Content-Type' => 'application/json'
    compressed
  else
    content_type :json
    json_content
  end
end

Database Column Compression

Compress large text columns before storage:

require 'zlib'
require 'active_record'

class Article < ActiveRecord::Base
  before_save :compress_content
  after_find :decompress_content
  
  attr_accessor :content_uncompressed
  
  def content
    @content_uncompressed || decompress_stored_content
  end
  
  def content=(value)
    @content_uncompressed = value
  end
  
  private
  
  def compress_content
    return unless @content_uncompressed
    
    if @content_uncompressed.bytesize > 1000
      self.content_compressed = true
      self.content_data = Zlib::Deflate.deflate(@content_uncompressed)
    else
      self.content_compressed = false
      self.content_data = @content_uncompressed
    end
  end
  
  def decompress_content
    return unless content_data
    @content_uncompressed = decompress_stored_content
  end
  
  def decompress_stored_content
    if content_compressed
      Zlib::Inflate.inflate(content_data)
    else
      content_data
    end
  end
end

Caching Compressed Content

Cache compressed responses for repeated requests:

require 'zlib'
require 'digest'

class CompressedCache
  def initialize
    @cache = {}
  end
  
  def fetch(key, &block)
    cache_key = cache_key_for(key)
    
    if @cache.key?(cache_key)
      decompress(@cache[cache_key])
    else
      content = block.call
      @cache[cache_key] = compress(content)
      content
    end
  end
  
  private
  
  def cache_key_for(key)
    Digest::SHA256.hexdigest(key.to_s)
  end
  
  def compress(data)
    Zlib::Deflate.deflate(data, Zlib::BEST_SPEED)
  end
  
  def decompress(data)
    Zlib::Inflate.inflate(data)
  end
end

cache = CompressedCache.new

response = cache.fetch('expensive_query') do
  # Execute expensive operation
  generate_large_report
end

Common Patterns

Several patterns establish standard practices for compression integration.

Transparent Compression

Implement compression transparently within infrastructure layers:

require 'zlib'

class TransparentStorage
  def initialize(backend)
    @backend = backend
  end
  
  def write(key, data)
    compressed = Zlib::Deflate.deflate(data)
    @backend.write(key, compressed)
    @backend.write("#{key}.meta", { compressed: true }.to_json)
  end
  
  def read(key)
    meta = JSON.parse(@backend.read("#{key}.meta"))
    data = @backend.read(key)
    
    if meta['compressed']
      Zlib::Inflate.inflate(data)
    else
      data
    end
  end
end

storage = TransparentStorage.new(file_backend)
storage.write('document', large_document)
document = storage.read('document')

Conditional Compression by Content Type

Apply compression selectively based on content characteristics:

require 'zlib'

class ContentTypeCompressor
  COMPRESSIBLE_TYPES = %w[
    text/plain
    text/html
    text/css
    text/javascript
    application/json
    application/xml
  ].freeze
  
  INCOMPRESSIBLE_EXTENSIONS = %w[
    .jpg .jpeg .png .gif .mp3 .mp4 .zip .gz
  ].freeze
  
  def compress?(content_type, filename)
    return false if already_compressed?(filename)
    compressible_type?(content_type)
  end
  
  private
  
  def already_compressed?(filename)
    INCOMPRESSIBLE_EXTENSIONS.any? { |ext| filename.end_with?(ext) }
  end
  
  def compressible_type?(content_type)
    COMPRESSIBLE_TYPES.any? { |type| content_type.start_with?(type) }
  end
end

compressor = ContentTypeCompressor.new
if compressor.compress?('text/html', 'page.html')
  compressed = Zlib::Deflate.deflate(content)
end

Progressive Compression

Compress data progressively as it streams:

require 'zlib'

class ProgressiveCompressor
  def initialize(output)
    @deflater = Zlib::Deflate.new
    @output = output
  end
  
  def write(chunk)
    compressed = @deflater.deflate(chunk)
    @output.write(compressed) unless compressed.empty?
  end
  
  def finish
    final = @deflater.finish
    @output.write(final)
    @deflater.close
  end
end

File.open('output.deflate', 'wb') do |output|
  compressor = ProgressiveCompressor.new(output)
  
  File.open('large_input.txt', 'rb') do |input|
    while chunk = input.read(8192)
      compressor.write(chunk)
    end
  end
  
  compressor.finish
end

Compression with Error Recovery

Handle compression errors gracefully:

require 'zlib'

class SafeCompressor
  class CompressionError < StandardError; end
  
  def compress(data, fallback: :original)
    compressed = Zlib::Deflate.deflate(data)
    
    # Verify compression worked
    decompressed = Zlib::Inflate.inflate(compressed)
    raise CompressionError unless decompressed == data
    
    compressed
  rescue Zlib::Error, CompressionError => e
    handle_error(e, data, fallback)
  end
  
  private
  
  def handle_error(error, data, fallback)
    case fallback
    when :original
      data
    when :raise
      raise error
    when Proc
      fallback.call(data, error)
    end
  end
end

compressor = SafeCompressor.new
result = compressor.compress(data, fallback: :original)

Reference

Ruby Compression Methods

Class/Module	Method	Description
Zlib::Deflate	deflate(string, level)	Compress string with specified level
Zlib::Inflate	inflate(string)	Decompress deflated string
Zlib::GzipWriter	new(io, level)	Create gzip writer for output stream
Zlib::GzipWriter	write(data)	Write data to compressed stream
Zlib::GzipReader	new(io)	Create gzip reader for input stream
Zlib::GzipReader	read(length)	Read decompressed data from stream
Zlib::GzipFile	open(filename, mode)	Open gzip file for reading or writing

Compression Levels

Level	Constant	Speed	Ratio	Use Case
0	NO_COMPRESSION	Fastest	None	No compression needed
1	BEST_SPEED	Very fast	Low	Real-time compression
2-5	-	Fast	Moderate	Network transmission
6	DEFAULT_COMPRESSION	Balanced	Good	General purpose
7-8	-	Slow	Better	Storage optimization
9	BEST_COMPRESSION	Slowest	Best	Archival storage

Algorithm Comparison

Algorithm	Compression Ratio	Compression Speed	Decompression Speed	Memory Usage	Use Case
DEFLATE/gzip	Good	Moderate	Fast	Low	General purpose
LZ4	Moderate	Very fast	Very fast	Very low	Real-time processing
Brotli	Excellent	Slow	Fast	High	Static web content
Zstandard	Excellent	Fast	Fast	Moderate	Modern applications
LZMA/XZ	Excellent	Very slow	Moderate	Very high	Archival compression

Compression Decision Matrix

Data Size	Compression Frequency	Access Pattern	Recommended Approach
< 100 bytes	Any	Any	Skip compression
100 bytes - 10 KB	Frequent	Sequential	Level 1 DEFLATE
10 KB - 1 MB	Frequent	Sequential	Level 3-6 DEFLATE
10 KB - 1 MB	Infrequent	Random	Block compression
> 1 MB	Frequent	Sequential	Level 6 DEFLATE
> 1 MB	Infrequent	Sequential	Level 9 DEFLATE
> 1 MB	Any	Random	Block compression

Format Extensions

Extension	Format	Algorithm	Header	Checksum
.gz	gzip	DEFLATE	Yes	CRC32
.zip	ZIP	DEFLATE	Yes	CRC32
.zst	Zstandard	Zstandard	Yes	XXH64
.lz4	LZ4	LZ4	Yes	XXH32
.br	Brotli	Brotli	Minimal	None
.xz	XZ	LZMA2	Yes	CRC64

Performance Benchmarks

Typical throughput on modern hardware (MB/s):

Algorithm	Level	Compression	Decompression
DEFLATE	1	200-300	300-500
DEFLATE	6	50-100	300-500
DEFLATE	9	20-30	300-500
LZ4	1	400-600	2000-3000
Brotli	6	10-20	300-400
Zstandard	3	300-400	600-800
Zstandard	19	5-10	600-800

Compression