CrackedRuby logo

CrackedRuby

GZip

Overview

Ruby provides GZip compression and decompression through the Zlib library, specifically the Zlib::GzipReader and Zlib::GzipWriter classes. These classes implement the RFC 1952 GZip file format, allowing Ruby applications to create and read compressed data compatible with standard GZip tools.

The GZip implementation operates on IO objects, supporting both file-based and in-memory compression operations. Ruby's GZip classes handle the complete format specification including headers, compression metadata, and checksums automatically.

require 'zlib'

# Basic file compression
Zlib::GzipFile.wrap(File.open('data.txt.gz', 'wb')) do |gz|
  gz.write("Hello, compressed world!")
end

# Basic file decompression  
Zlib::GzipFile.wrap(File.open('data.txt.gz', 'rb')) do |gz|
  puts gz.read
end
# => "Hello, compressed world!"

The Zlib::GzipWriter class compresses data as it writes, while Zlib::GzipReader decompresses data during reading. Both classes maintain compatibility with external GZip files and provide access to compression metadata including modification times, original filenames, and comments.

Ruby's GZip implementation supports all standard compression levels from 0 (no compression) through 9 (maximum compression), with level 6 as the default. The classes also provide streaming capabilities for processing large datasets without loading entire files into memory.

Basic Usage

File compression with Zlib::GzipWriter requires opening a binary write stream and writing data through the GZip wrapper. The writer automatically handles format headers, compression algorithms, and file checksums.

require 'zlib'

# Compress text file
File.open('large_data.txt.gz', 'wb') do |file|
  Zlib::GzipWriter.wrap(file) do |gz|
    gz.write("Line 1: Important data\n")
    gz.write("Line 2: More information\n") 
    gz.write("Line 3: Final content\n")
  end
end

Reading compressed files uses Zlib::GzipReader with similar IO patterns. The reader transparently decompresses data and provides standard IO methods including read, gets, and each_line.

# Decompress and process file
File.open('large_data.txt.gz', 'rb') do |file|
  Zlib::GzipReader.wrap(file) do |gz|
    gz.each_line do |line|
      puts "Processed: #{line.chomp}"
    end
  end
end
# => Processed: Line 1: Important data
# => Processed: Line 2: More information  
# => Processed: Line 3: Final content

String compression operates through StringIO objects, allowing in-memory compression without temporary files. This approach suits small to medium datasets and network operations.

require 'stringio'

# Compress string to bytes
def compress_string(data)
  io = StringIO.new
  Zlib::GzipWriter.wrap(io) do |gz|
    gz.write(data)
  end
  io.string
end

# Decompress bytes to string
def decompress_string(compressed_data)
  io = StringIO.new(compressed_data)
  Zlib::GzipReader.wrap(io) do |gz|
    gz.read
  end
end

original = "This text will be compressed using GZip"
compressed = compress_string(original)
decompressed = decompress_string(compressed)

puts "Original size: #{original.bytesize} bytes"
puts "Compressed size: #{compressed.bytesize} bytes"  
puts "Decompressed: #{decompressed}"
# => Original size: 38 bytes
# => Compressed size: 56 bytes
# => Decompressed: This text will be compressed using GZip

The GZip format includes metadata storage for original filenames, modification times, and comments. Ruby provides access to this information through reader properties and allows setting metadata during compression.

# Set metadata during compression
File.open('archive.txt.gz', 'wb') do |file|
  Zlib::GzipWriter.wrap(file) do |gz|
    gz.orig_name = "original_file.txt"
    gz.comment = "Archived on #{Time.now}"
    gz.mtime = Time.now
    gz.write("File contents with metadata")
  end
end

# Read metadata from compressed file
File.open('archive.txt.gz', 'rb') do |file|
  Zlib::GzipReader.wrap(file) do |gz|
    puts "Original name: #{gz.orig_name}"
    puts "Comment: #{gz.comment}"
    puts "Modified: #{gz.mtime}"
    puts "Content: #{gz.read}"
  end
end

Error Handling & Debugging

GZip operations raise specific exceptions for different failure conditions. The Zlib::Error hierarchy provides detailed error information for debugging compression issues.

require 'zlib'

def safe_decompress(filename)
  File.open(filename, 'rb') do |file|
    Zlib::GzipReader.wrap(file) do |gz|
      return gz.read
    end
  end
rescue Zlib::GzipFile::Error => e
  puts "GZip format error: #{e.message}"
  nil
rescue Zlib::DataError => e
  puts "Corrupted data: #{e.message}"
  nil
rescue Zlib::BufError => e
  puts "Buffer error: #{e.message}" 
  nil
rescue Errno::ENOENT
  puts "File not found: #{filename}"
  nil
end

# Test with corrupted file
File.write('broken.gz', 'not gzip data')
result = safe_decompress('broken.gz')
# => GZip format error: not in gzip format

Validating GZip files before processing prevents application crashes and provides user feedback for invalid data. Ruby's GZip implementation performs header validation automatically but requires explicit error handling.

def validate_gzip_file(filename)
  File.open(filename, 'rb') do |file|
    # Check magic number (first 2 bytes)
    magic = file.read(2)
    return false if magic != "\x1f\x8b"
    
    file.rewind
    Zlib::GzipReader.wrap(file) do |gz|
      # Attempt to read first byte to validate format
      gz.readchar
      return true
    end
  end
rescue Zlib::Error, EOFError
  false
rescue Errno::ENOENT
  false
end

# Create test files
File.write('valid.gz', Zlib::Deflate.deflate('test'))
File.write('invalid.gz', 'random data')

puts validate_gzip_file('valid.gz')   # => false (deflate != gzip)
puts validate_gzip_file('invalid.gz') # => false

Memory exhaustion during decompression of maliciously crafted files requires defensive programming. Implementing size limits and streaming approaches prevents denial-of-service attacks.

def safe_decompress_with_limit(filename, max_size: 10 * 1024 * 1024)
  decompressed_size = 0
  result = String.new
  
  File.open(filename, 'rb') do |file|
    Zlib::GzipReader.wrap(file) do |gz|
      while chunk = gz.read(8192) # Read in chunks
        decompressed_size += chunk.bytesize
        
        if decompressed_size > max_size
          raise "Decompressed size exceeds limit: #{max_size} bytes"
        end
        
        result << chunk
      end
    end
  end
  
  result
rescue Zlib::Error => e
  raise "Decompression failed: #{e.message}"
end

Performance & Memory

Compression levels significantly impact both processing time and output size. Higher compression levels require more CPU cycles but produce smaller files, creating trade-offs for different use cases.

require 'benchmark'

data = "x" * 100_000 # 100KB of repeated character

Benchmark.bm(15) do |x|
  (0..9).each do |level|
    x.report("Level #{level}:") do
      io = StringIO.new
      Zlib::GzipWriter.wrap(io, level) do |gz|
        gz.write(data)
      end
      
      compressed_size = io.string.bytesize
      ratio = (compressed_size.to_f / data.bytesize * 100).round(1)
      print " #{compressed_size} bytes (#{ratio}%)"
    end
  end
end

Streaming large files prevents memory exhaustion by processing data in chunks. Ruby's GZip classes support streaming operations that maintain constant memory usage regardless of file size.

def stream_compress_file(input_path, output_path, chunk_size: 64 * 1024)
  File.open(output_path, 'wb') do |output_file|
    Zlib::GzipWriter.wrap(output_file) do |gz|
      File.open(input_path, 'rb') do |input_file|
        while chunk = input_file.read(chunk_size)
          gz.write(chunk)
        end
      end
    end
  end
end

def stream_decompress_file(input_path, output_path, chunk_size: 64 * 1024)
  File.open(input_path, 'rb') do |input_file|
    Zlib::GzipReader.wrap(input_file) do |gz|
      File.open(output_path, 'wb') do |output_file|
        while chunk = gz.read(chunk_size)
          output_file.write(chunk)
        end
      end
    end
  end
end

Buffer management affects compression performance and memory usage. Ruby's GZip implementation uses internal buffers that can be tuned for specific workloads through chunk size selection.

class OptimizedGzipProcessor
  def initialize(buffer_size: 32 * 1024)
    @buffer_size = buffer_size
  end
  
  def compress_data(data)
    io = StringIO.new
    Zlib::GzipWriter.wrap(io) do |gz|
      data.each_slice(@buffer_size) do |chunk|
        gz.write(chunk.join)
      end
    end
    io.string
  end
  
  def process_large_dataset(file_pattern)
    Dir.glob(file_pattern).each do |filename|
      compressed_name = "#{filename}.gz"
      
      start_time = Time.now
      stream_compress_file(filename, compressed_name, chunk_size: @buffer_size)
      processing_time = Time.now - start_time
      
      original_size = File.size(filename)
      compressed_size = File.size(compressed_name)
      ratio = (compressed_size.to_f / original_size * 100).round(1)
      
      puts "#{filename}: #{original_size}#{compressed_size} (#{ratio}%) in #{processing_time.round(2)}s"
    end
  end
end

Production Patterns

Web applications commonly use GZip compression for HTTP response compression. Ruby web frameworks and Rack middleware provide integration points for transparent response compression.

# Rack middleware example
class GzipResponseMiddleware
  def initialize(app, options = {})
    @app = app
    @min_size = options[:min_size] || 1024
    @compress_types = options[:types] || %w[text/html text/css text/javascript application/json]
  end
  
  def call(env)
    status, headers, body = @app.call(env)
    headers = Rack::Utils::HeaderHash.new(headers)
    
    return [status, headers, body] unless should_compress?(env, headers, body)
    
    compressed_body = compress_response(body)
    headers['Content-Encoding'] = 'gzip'
    headers['Content-Length'] = compressed_body.bytesize.to_s
    headers.delete('ETag') # Remove ETag as content changed
    
    [status, headers, [compressed_body]]
  end
  
  private
  
  def should_compress?(env, headers, body)
    return false unless env['HTTP_ACCEPT_ENCODING']&.include?('gzip')
    return false unless @compress_types.any? { |type| headers['Content-Type']&.start_with?(type) }
    
    body_size = body.respond_to?(:bytesize) ? body.bytesize : body.sum(&:bytesize)
    body_size >= @min_size
  end
  
  def compress_response(body)
    io = StringIO.new
    Zlib::GzipWriter.wrap(io) do |gz|
      body.each { |chunk| gz.write(chunk) }
    end
    io.string
  end
end

Log file compression for long-term storage requires handling active log rotation and maintaining searchability. Background processing with proper error handling prevents impact on application performance.

class LogCompressor
  def initialize(log_directory, retention_days: 30)
    @log_directory = log_directory
    @retention_days = retention_days
  end
  
  def compress_old_logs
    old_log_files.each do |log_file|
      compress_log_file(log_file)
      File.delete(log_file) if File.exist?("#{log_file}.gz")
    end
    
    cleanup_expired_logs
  end
  
  private
  
  def old_log_files
    pattern = File.join(@log_directory, "*.log")
    Dir.glob(pattern).select do |file|
      File.mtime(file) < Time.now - (24 * 60 * 60) # Older than 1 day
    end
  end
  
  def compress_log_file(log_file)
    compressed_file = "#{log_file}.gz"
    return if File.exist?(compressed_file)
    
    File.open(compressed_file, 'wb') do |output|
      Zlib::GzipWriter.wrap(output) do |gz|
        gz.orig_name = File.basename(log_file)
        gz.mtime = File.mtime(log_file)
        
        File.open(log_file, 'rb') do |input|
          IO.copy_stream(input, gz)
        end
      end
    end
  rescue => e
    File.delete(compressed_file) if File.exist?(compressed_file)
    raise "Failed to compress #{log_file}: #{e.message}"
  end
  
  def cleanup_expired_logs
    cutoff_time = Time.now - (@retention_days * 24 * 60 * 60)
    
    Dir.glob(File.join(@log_directory, "*.gz")).each do |gz_file|
      File.delete(gz_file) if File.mtime(gz_file) < cutoff_time
    end
  end
end

Database backup compression reduces storage costs and transfer times. Streaming compression during backup creation eliminates temporary file requirements and improves performance.

class DatabaseBackupCompressor
  def initialize(database_url, backup_location)
    @database_url = database_url
    @backup_location = backup_location
  end
  
  def create_compressed_backup
    timestamp = Time.now.strftime("%Y%m%d_%H%M%S")
    backup_file = File.join(@backup_location, "backup_#{timestamp}.sql.gz")
    
    File.open(backup_file, 'wb') do |file|
      Zlib::GzipWriter.wrap(file) do |gz|
        gz.orig_name = "backup_#{timestamp}.sql"
        gz.mtime = Time.now
        
        # Stream pg_dump output directly to GZip
        IO.popen(["pg_dump", @database_url], "rb") do |dump|
          IO.copy_stream(dump, gz)
        end
      end
    end
    
    verify_backup(backup_file)
    backup_file
  end
  
  def restore_from_backup(backup_file)
    File.open(backup_file, 'rb') do |file|
      Zlib::GzipReader.wrap(file) do |gz|
        IO.popen(["psql", @database_url], "wb") do |psql|
          IO.copy_stream(gz, psql)
        end
      end
    end
  end
  
  private
  
  def verify_backup(backup_file)
    File.open(backup_file, 'rb') do |file|
      Zlib::GzipReader.wrap(file) do |gz|
        # Read first few bytes to verify decompression works
        header = gz.read(100)
        raise "Invalid backup file" unless header&.include?("PostgreSQL")
      end
    end
  end
end

Common Pitfalls

Character encoding issues arise when compressing text data without proper encoding handling. GZip operates on bytes, requiring explicit encoding specification for text data to prevent corruption.

# Incorrect: Encoding ignored during compression
def compress_text_wrong(text)
  io = StringIO.new
  Zlib::GzipWriter.wrap(io) do |gz|
    gz.write(text) # May use wrong encoding
  end
  io.string
end

# Correct: Explicit encoding handling
def compress_text_correct(text, encoding: 'UTF-8')
  io = StringIO.new
  Zlib::GzipWriter.wrap(io) do |gz|
    gz.write(text.encode(encoding))
  end
  io.string
end

# Test with non-ASCII text
text = "Héllo Wörld! 🌍"
compressed = compress_text_correct(text)

# Decompression with encoding restoration
def decompress_text(compressed_data, encoding: 'UTF-8')
  io = StringIO.new(compressed_data)
  Zlib::GzipReader.wrap(io) do |gz|
    gz.read.force_encoding(encoding)
  end
end

decompressed = decompress_text(compressed)
puts decompressed == text # => true

Resource leaks occur when GZip streams remain unclosed, particularly in error conditions. Ruby's garbage collector cannot immediately release compressed file handles, causing resource exhaustion.

# Problematic: Manual resource management
def risky_compression(files)
  files.each do |filename|
    gz = Zlib::GzipWriter.new(File.open("#{filename}.gz", 'wb'))
    gz.write(File.read(filename))
    # Missing gz.close - resource leak!
  end
end

# Safe: Block-based resource management
def safe_compression(files)
  files.each do |filename|
    File.open("#{filename}.gz", 'wb') do |file|
      Zlib::GzipWriter.wrap(file) do |gz|
        gz.write(File.read(filename))
        # Automatic cleanup via block
      end
    end
  end
end

# Safest: Exception-safe resource cleanup
def robust_compression(files)
  files.each do |filename|
    output_file = nil
    gz = nil
    
    begin
      output_file = File.open("#{filename}.gz", 'wb')
      gz = Zlib::GzipWriter.new(output_file)
      gz.write(File.read(filename))
    ensure
      gz&.close
      output_file&.close
    end
  end
end

Compression level misconceptions lead to inappropriate settings for specific use cases. Maximum compression (level 9) rarely provides significant benefits over moderate levels while consuming substantially more CPU resources.

require 'benchmark'

def demonstrate_compression_tradeoffs(data)
  levels = [1, 6, 9]
  results = {}
  
  levels.each do |level|
    time = Benchmark.realtime do
      io = StringIO.new
      Zlib::GzipWriter.wrap(io, level) do |gz|
        gz.write(data)
      end
      results[level] = {
        size: io.string.bytesize,
        ratio: (io.string.bytesize.to_f / data.bytesize * 100).round(1)
      }
    end
    results[level][:time] = time.round(4)
  end
  
  puts "Compression Analysis for #{data.bytesize} byte input:"
  results.each do |level, stats|
    puts "Level #{level}: #{stats[:size]} bytes (#{stats[:ratio]}%) in #{stats[:time]}s"
  end
  
  # Calculate efficiency (compression per second)
  results.each do |level, stats|
    efficiency = ((100 - stats[:ratio]) / stats[:time]).round(1)
    puts "Level #{level} efficiency: #{efficiency} compression points/second"
  end
end

# Test with different data types
text_data = File.read('/usr/share/dict/words') rescue "word " * 10000
demonstrate_compression_tradeoffs(text_data)

Memory accumulation during streaming operations occurs when developers buffer entire streams instead of processing data in chunks. This pattern defeats the memory benefits of streaming compression.

# Memory-intensive: Accumulates entire stream
def bad_streaming_decompress(filename)
  result = String.new
  
  File.open(filename, 'rb') do |file|
    Zlib::GzipReader.wrap(file) do |gz|
      # Loads entire file into memory
      result = gz.read 
    end
  end
  
  result
end

# Memory-efficient: Processes chunks
def good_streaming_decompress(filename, &block)
  File.open(filename, 'rb') do |file|
    Zlib::GzipReader.wrap(file) do |gz|
      while chunk = gz.read(64 * 1024)
        yield chunk
      end
    end
  end
end

# Example usage with memory monitoring
def process_large_gzip_file(filename)
  total_processed = 0
  
  good_streaming_decompress(filename) do |chunk|
    # Process chunk without storing
    total_processed += chunk.bytesize
    
    # Periodic progress reporting
    if total_processed % (1024 * 1024) == 0
      puts "Processed #{total_processed / 1024 / 1024}MB"
    end
  end
end

Reference

Core Classes

Class Purpose Key Methods
Zlib::GzipWriter Compress data to GZip format #write, #puts, #close, #flush
Zlib::GzipReader Decompress GZip format data #read, #gets, #each_line, #rewind
Zlib::GzipFile Base class for GZip operations #close, #closed?, #sync, #sync=

GzipWriter Methods

Method Parameters Returns Description
#initialize(io, level=nil, strategy=nil) IO object, compression level (0-9), strategy GzipWriter Creates new writer with optional compression settings
#write(string) String data Integer Writes string to compressed stream, returns bytes written
#puts(*objects) Objects to write nil Writes objects as lines with newline separators
#print(*objects) Objects to write nil Writes objects to stream without separators
#printf(format, *objects) Format string, objects nil Writes formatted output to stream
#flush(flush=nil) Flush mode constant self Flushes pending data to underlying IO
#close None nil Closes writer and finalizes compression
#orig_name= String filename String Sets original filename metadata
#comment= String comment String Sets comment metadata
#mtime= Time object Time Sets modification time metadata

GzipReader Methods

Method Parameters Returns Description
#initialize(io) IO object GzipReader Creates new reader for compressed stream
#read(length=nil) Bytes to read String/nil Reads decompressed data, nil at EOF
#gets(separator=$/) Line separator String/nil Reads line from decompressed stream
#each_line(&block) Block Enumerator Iterates over lines in decompressed data
#readlines(separator=$/) Line separator Array Returns all lines as array
#rewind None 0 Resets reader to beginning of stream
#pos None Integer Returns current position in decompressed data
#eof? None Boolean Returns true if at end of file
#orig_name None String/nil Returns original filename from metadata
#comment None String/nil Returns comment from metadata
#mtime None Time Returns modification time from metadata

Compression Levels

Level Speed Ratio Use Case
0 Fastest None Storage without compression
1-3 Fast Low Real-time compression, CPU-limited
4-6 Balanced Medium General purpose, default (6)
7-9 Slow High Archival, bandwidth-limited

Exception Hierarchy

Exception Parent Description
Zlib::Error StandardError Base class for all Zlib errors
Zlib::GzipFile::Error Zlib::Error GZip format or operation errors
Zlib::GzipFile::NoFooter GzipFile::Error Missing or invalid file footer
Zlib::GzipFile::CRCError GzipFile::Error Checksum validation failure
Zlib::GzipFile::LengthError GzipFile::Error Length validation failure
Zlib::DataError Zlib::Error Corrupted or invalid compressed data
Zlib::BufError Zlib::Error Buffer size or state errors
Zlib::StreamError Zlib::Error Stream state or operation errors

File Format Constants

Constant Value Description
Zlib::GzipFile::SYNC_FLUSH 2 Flush mode for real-time streaming
Zlib::GzipFile::FULL_FLUSH 3 Complete buffer flush
Zlib::BEST_SPEED 1 Fastest compression
Zlib::BEST_COMPRESSION 9 Maximum compression
Zlib::DEFAULT_COMPRESSION -1 Default compression level (6)

Convenience Methods

Method Parameters Returns Description
Zlib::GzipWriter.open(filename, level=nil) Filename, compression level GzipWriter Opens file for writing with automatic close
Zlib::GzipReader.open(filename) Filename GzipReader Opens file for reading with automatic close
Zlib::GzipFile.wrap(io, &block) IO object, block Block result Wraps IO with appropriate GZip class
Zlib.gzip(string, level=nil) String data, compression level String Compresses string to GZip bytes
Zlib.gunzip(gzipped_string) GZip bytes String Decompresses GZip bytes to string