CrackedRuby logo

CrackedRuby

Tar Archives

Overview

Ruby handles tar archives primarily through the archive-tar-minitar gem, which provides pure Ruby implementation for reading and writing POSIX tar archive files. The library offers both high-level convenience methods and low-level streaming interfaces for processing tar archives without loading entire archives into memory.

The core classes include Archive::Tar::Minitar for high-level operations, Archive::Tar::Minitar::Reader for streaming reads, and Archive::Tar::Minitar::Writer for streaming writes. These classes handle standard tar format compliance, including file metadata preservation, directory structures, and symbolic links.

require 'archive/tar/minitar'
require 'zlib'

# Create a compressed tar archive
File.open('archive.tar.gz', 'wb') do |file|
  Zlib::GzipWriter.wrap(file) do |gzip|
    Archive::Tar::Minitar.pack(['file1.txt', 'dir/'], gzip)
  end
end

The library integrates with Ruby's compression libraries like zlib for gzip compression and bzip2-ffi for bzip2 compression. File permissions, timestamps, and ownership information are preserved during archive operations when the underlying filesystem supports these attributes.

# Extract with preserved metadata
File.open('archive.tar.gz', 'rb') do |file|
  Zlib::GzipReader.wrap(file) do |gzip|
    Archive::Tar::Minitar.unpack(gzip, 'extracted/')
  end
end

The minitar implementation handles both GNU tar and POSIX tar formats, with automatic format detection during reading operations. Archive entries can be files, directories, symbolic links, or special device files, each maintaining their original attributes and content.

Basic Usage

Creating tar archives requires specifying source files or directories as an array of paths. The Archive::Tar::Minitar.pack method recursively includes directory contents and maintains the relative path structure within the archive.

require 'archive/tar/minitar'

# Create uncompressed tar archive
File.open('backup.tar', 'wb') do |file|
  Archive::Tar::Minitar.pack(['app/', 'config.yml', 'README.md'], file)
end

The pack method accepts various path types including files, directories, and glob patterns. Directory inclusion follows symbolic links by default, though this behavior can be controlled through options.

# Create archive with multiple source types
sources = [
  'src/',           # Directory (recursive)
  'main.rb',        # Single file
  'docs/*.md'       # Glob pattern
]

File.open('project.tar', 'wb') do |tar_file|
  expanded_sources = sources.flat_map { |src| Dir.glob(src) }.uniq
  Archive::Tar::Minitar.pack(expanded_sources, tar_file)
end

Extracting archives uses the Archive::Tar::Minitar.unpack method, which reads the tar stream and recreates the file structure in the specified destination directory. The method preserves file permissions and timestamps when possible.

# Extract tar archive to specific directory
File.open('backup.tar', 'rb') do |file|
  Archive::Tar::Minitar.unpack(file, 'restore/')
end

Combining with compression requires wrapping the file stream with appropriate compression classes. The pattern remains consistent across different compression formats.

# Create gzip-compressed tar
File.open('archive.tar.gz', 'wb') do |file|
  Zlib::GzipWriter.wrap(file) do |gzip|
    Archive::Tar::Minitar.pack(['data/'], gzip)
  end
end

# Extract gzip-compressed tar
File.open('archive.tar.gz', 'rb') do |file|
  Zlib::GzipReader.wrap(file) do |gzip|
    Archive::Tar::Minitar.unpack(gzip, 'extracted/')
  end
end

The library handles path normalization automatically, converting backslashes to forward slashes and removing redundant path separators. Absolute paths are converted to relative paths to prevent extraction outside the intended directory.

Advanced Usage

Streaming operations provide memory-efficient processing for large archives through the Reader and Writer classes. These classes process archive entries individually without loading the entire archive into memory.

require 'archive/tar/minitar'

# Stream-based archive creation
File.open('streaming.tar', 'wb') do |file|
  Archive::Tar::Minitar::Writer.open(file) do |writer|
    # Add files individually with custom metadata
    Dir.glob('source/**/*').each do |path|
      next if File.directory?(path)
      
      stat = File.stat(path)
      writer.add_file_simple(path, stat.mode, stat.size) do |entry|
        File.open(path, 'rb') { |src| entry.write(src.read) }
      end
    end
    
    # Add in-memory content
    content = "Generated at #{Time.now}"
    writer.add_file_simple('timestamp.txt', 0644, content.bytesize) do |entry|
      entry.write(content)
    end
  end
end

Custom filtering during extraction allows selective restoration and path manipulation. The reader provides access to individual entry metadata before extraction decisions.

# Selective extraction with custom filtering
File.open('archive.tar', 'rb') do |file|
  Archive::Tar::Minitar::Reader.open(file) do |reader|
    reader.each do |entry|
      # Skip hidden files and certain extensions
      next if entry.name.start_with?('.')
      next if entry.name.end_with?('.tmp', '.log')
      
      # Modify extraction path
      extract_path = entry.name.gsub(/^old_prefix\//, 'new_prefix/')
      
      if entry.directory?
        FileUtils.mkdir_p(extract_path)
      else
        FileUtils.mkdir_p(File.dirname(extract_path))
        File.open(extract_path, 'wb') do |output|
          output.write(entry.read)
        end
        File.chmod(entry.mode, extract_path) if entry.mode
      end
    end
  end
end

Archive inspection and metadata extraction enables analysis without full extraction. Each entry provides comprehensive information about the archived content.

# Archive analysis and content inspection
def analyze_tar(tar_path)
  entries = []
  total_size = 0
  
  File.open(tar_path, 'rb') do |file|
    Archive::Tar::Minitar::Reader.open(file) do |reader|
      reader.each do |entry|
        entries << {
          name: entry.name,
          size: entry.size,
          mode: entry.mode,
          mtime: entry.mtime,
          type: case entry.typeflag
                 when '0', "\0" then :file
                 when '5' then :directory
                 when '2' then :symlink
                 else :other
                 end
        }
        total_size += entry.size
      end
    end
  end
  
  { entries: entries, total_size: total_size, count: entries.length }
end

Multi-volume archive handling requires coordinating multiple tar files for archives exceeding size limits. The approach involves splitting at file boundaries to maintain archive integrity.

# Create multi-volume archives with size limits
class MultiVolumeCreator
  def initialize(base_name, volume_size_mb = 100)
    @base_name = base_name
    @volume_size = volume_size_mb * 1024 * 1024
    @current_volume = 1
    @current_size = 0
    @writer = nil
  end
  
  def add_file(path)
    file_size = File.size(path)
    
    # Start new volume if current would exceed limit
    if @writer && (@current_size + file_size > @volume_size)
      @writer.close
      @writer = nil
      @current_volume += 1
      @current_size = 0
    end
    
    # Open new volume if needed
    unless @writer
      volume_path = "#{@base_name}.vol#{@current_volume}.tar"
      @writer = Archive::Tar::Minitar::Writer.open(File.open(volume_path, 'wb'))
    end
    
    @writer.add_file_simple(path, File.stat(path).mode, file_size) do |entry|
      File.open(path, 'rb') { |src| entry.write(src.read) }
    end
    
    @current_size += file_size
  end
  
  def close
    @writer&.close
  end
end

Error Handling & Debugging

Tar operations encounter various error conditions including missing files, permission issues, corrupted archives, and filesystem limitations. Proper error handling requires catching specific exceptions and providing meaningful recovery paths.

require 'archive/tar/minitar'

def safe_create_archive(sources, output_path)
  begin
    File.open(output_path, 'wb') do |file|
      Archive::Tar::Minitar.pack(sources, file)
    end
  rescue Errno::ENOENT => e
    # Handle missing source files
    missing_file = e.message.match(/No such file or directory - (.+)/)[1]
    raise "Source file not found: #{missing_file}"
  rescue Errno::EACCES => e
    # Handle permission errors
    raise "Permission denied: #{e.message}"
  rescue SystemCallError => e
    # Handle other filesystem errors
    raise "Filesystem error: #{e.message}"
  rescue StandardError => e
    # Clean up partial archive on unexpected errors
    File.unlink(output_path) if File.exist?(output_path)
    raise "Archive creation failed: #{e.message}"
  end
end

Archive validation before extraction prevents security issues and corrupted data problems. Validation includes format verification, path traversal detection, and size limit enforcement.

def validate_and_extract(archive_path, extract_to, max_size: 100_000_000)
  total_size = 0
  entries = []
  
  # First pass: validate archive structure
  File.open(archive_path, 'rb') do |file|
    Archive::Tar::Minitar::Reader.open(file) do |reader|
      reader.each do |entry|
        # Check for directory traversal attacks
        normalized = File.expand_path(entry.name, extract_to)
        unless normalized.start_with?(File.expand_path(extract_to))
          raise "Security violation: path traversal detected in #{entry.name}"
        end
        
        # Check size limits
        total_size += entry.size
        if total_size > max_size
          raise "Archive too large: exceeds #{max_size} bytes"
        end
        
        entries << entry.name
      end
    end
  end
  
  # Second pass: extract validated archive
  File.open(archive_path, 'rb') do |file|
    Archive::Tar::Minitar.unpack(file, extract_to)
  end
  
  entries
rescue Archive::Tar::Minitar::InvalidTarStream => e
  raise "Corrupted archive: #{e.message}"
rescue StandardError => e
  # Clean up partial extraction
  FileUtils.rm_rf(extract_to) if File.exist?(extract_to)
  raise "Extraction failed: #{e.message}"
end

Streaming error recovery allows processing to continue when individual entries fail, collecting errors for later analysis while preserving successful operations.

class RobustExtractor
  def initialize(error_handler: :collect)
    @errors = []
    @extracted_files = []
    @error_handler = error_handler
  end
  
  def extract_with_recovery(archive_path, extract_to)
    File.open(archive_path, 'rb') do |file|
      Archive::Tar::Minitar::Reader.open(file) do |reader|
        reader.each do |entry|
          begin
            extract_entry(entry, extract_to)
            @extracted_files << entry.name
          rescue StandardError => e
            error = { file: entry.name, error: e.message }
            @errors << error
            
            case @error_handler
            when :raise_first
              raise "Failed to extract #{entry.name}: #{e.message}"
            when :warn
              warn "Warning: Failed to extract #{entry.name}: #{e.message}"
            when :collect
              # Continue processing, collect errors
              next
            end
          end
        end
      end
    end
    
    { extracted: @extracted_files, errors: @errors }
  end
  
  private
  
  def extract_entry(entry, base_path)
    full_path = File.join(base_path, entry.name)
    
    if entry.directory?
      FileUtils.mkdir_p(full_path)
    else
      FileUtils.mkdir_p(File.dirname(full_path))
      File.open(full_path, 'wb') do |output|
        output.write(entry.read)
      end
      File.chmod(entry.mode, full_path) if entry.mode
    end
  end
end

Performance & Memory

Memory usage optimization requires streaming approaches for large archives, avoiding loading entire contents into memory simultaneously. The choice between convenience methods and streaming classes significantly impacts memory consumption.

require 'benchmark'
require 'archive/tar/minitar'

# Memory-efficient streaming vs. convenience method comparison
def benchmark_approaches(large_files)
  puts "Creating archive with #{large_files.length} files"
  
  # Memory-intensive approach (loads all files)
  memory_intensive = Benchmark.measure do
    File.open('memory_intensive.tar', 'wb') do |file|
      Archive::Tar::Minitar.pack(large_files, file)
    end
  end
  
  # Memory-efficient streaming approach
  memory_efficient = Benchmark.measure do
    File.open('memory_efficient.tar', 'wb') do |file|
      Archive::Tar::Minitar::Writer.open(file) do |writer|
        large_files.each do |path|
          next unless File.file?(path)
          
          File.open(path, 'rb') do |input|
            writer.add_file_simple(path, File.stat(path).mode, File.size(path)) do |entry|
              while chunk = input.read(8192)
                entry.write(chunk)
              end
            end
          end
        end
      end
    end
  end
  
  puts "Memory intensive: #{memory_intensive}"
  puts "Memory efficient: #{memory_efficient}"
end

Compression level optimization balances file size reduction with processing time. Different compression algorithms provide varying trade-offs between compression ratio and speed.

# Compare compression methods and levels
def compression_benchmark(source_files)
  results = {}
  
  # Uncompressed baseline
  results[:uncompressed] = benchmark_compression(source_files, 'baseline.tar') do |file|
    Archive::Tar::Minitar.pack(source_files, file)
  end
  
  # Gzip compression levels
  (1..9).each do |level|
    results["gzip_#{level}".to_sym] = benchmark_compression(source_files, "gzip_#{level}.tar.gz") do |file|
      Zlib::GzipWriter.wrap(file, level) do |gzip|
        Archive::Tar::Minitar.pack(source_files, gzip)
      end
    end
  end
  
  # Bzip2 compression (requires bzip2-ffi gem)
  if defined?(Bzip2::FFI)
    results[:bzip2] = benchmark_compression(source_files, 'bzip2.tar.bz2') do |file|
      Bzip2::FFI::Writer.wrap(file) do |bzip2|
        Archive::Tar::Minitar.pack(source_files, bzip2)
      end
    end
  end
  
  results
end

def benchmark_compression(source_files, output_path)
  start_time = Time.now
  
  File.open(output_path, 'wb') do |file|
    yield file
  end
  
  end_time = Time.now
  file_size = File.size(output_path)
  
  {
    time: end_time - start_time,
    size: file_size,
    path: output_path
  }
ensure
  File.unlink(output_path) if File.exist?(output_path)
end

Parallel processing for multiple archives can improve throughput when creating or extracting multiple independent archives simultaneously.

require 'parallel'

# Parallel archive creation
def create_archives_parallel(source_groups, max_threads: 4)
  Parallel.each(source_groups, in_threads: max_threads) do |name, sources|
    output_path = "#{name}.tar.gz"
    
    File.open(output_path, 'wb') do |file|
      Zlib::GzipWriter.wrap(file) do |gzip|
        Archive::Tar::Minitar.pack(sources, gzip)
      end
    end
    
    puts "Created #{output_path} (#{File.size(output_path)} bytes)"
  end
end

# Usage
archive_groups = {
  'logs' => Dir.glob('logs/**/*.log'),
  'docs' => Dir.glob('docs/**/*.{md,txt}'),
  'configs' => Dir.glob('config/**/*.{yml,json}')
}

create_archives_parallel(archive_groups)

Common Pitfalls

Path handling inconsistencies between operating systems cause archive portability issues. Windows path separators, case sensitivity differences, and path length limitations affect cross-platform archive compatibility.

# Problematic path handling
def problematic_archive_creation
  # This creates platform-specific paths in archive
  sources = Dir.glob('C:\\Users\\*\\Documents\\*.txt')  # Windows-specific
  Archive::Tar::Minitar.pack(sources, file)  # Creates non-portable archive
end

# Correct cross-platform path normalization
def portable_archive_creation(base_dir, patterns)
  sources = patterns.flat_map { |pattern| Dir.glob(File.join(base_dir, pattern)) }
  
  # Normalize paths for archive portability
  normalized_sources = sources.map do |path|
    # Convert to relative path with forward slashes
    relative_path = Pathname.new(path).relative_path_from(Pathname.new(base_dir))
    relative_path.to_s.gsub('\\', '/')
  end
  
  File.open('portable.tar', 'wb') do |file|
    Archive::Tar::Minitar.pack(normalized_sources, file)
  end
end

Permission preservation failures occur when extracting archives across different filesystems or when running with insufficient privileges. The extraction process silently ignores permission errors in many cases.

# Detect and handle permission preservation issues
def extract_with_permission_tracking(archive_path, extract_to)
  permission_failures = []
  
  File.open(archive_path, 'rb') do |file|
    Archive::Tar::Minitar::Reader.open(file) do |reader|
      reader.each do |entry|
        extract_path = File.join(extract_to, entry.name)
        
        if entry.directory?
          FileUtils.mkdir_p(extract_path)
        else
          FileUtils.mkdir_p(File.dirname(extract_path))
          File.open(extract_path, 'wb') { |f| f.write(entry.read) }
        end
        
        # Attempt permission restoration with error tracking
        if entry.mode
          begin
            File.chmod(entry.mode, extract_path)
          rescue Errno::EPERM, Errno::ENOTSUP => e
            permission_failures << {
              path: extract_path,
              intended_mode: entry.mode.to_s(8),
              error: e.message
            }
          end
        end
      end
    end
  end
  
  unless permission_failures.empty?
    warn "Permission restoration failed for #{permission_failures.length} files"
    permission_failures.each do |failure|
      warn "  #{failure[:path]}: #{failure[:error]}"
    end
  end
  
  permission_failures
end

Archive corruption from incomplete writes happens when archive creation is interrupted or when insufficient disk space prevents complete file writing. These conditions require careful error handling and validation.

# Robust archive creation with corruption prevention
def create_archive_safely(sources, output_path, temp_dir: Dir.tmpdir)
  temp_path = File.join(temp_dir, "#{File.basename(output_path)}.tmp")
  
  begin
    # Create archive in temporary location first
    File.open(temp_path, 'wb') do |file|
      Archive::Tar::Minitar.pack(sources, file)
      file.fsync  # Force write to disk
    end
    
    # Verify archive integrity before moving to final location
    verify_archive_integrity(temp_path)
    
    # Atomic move to final location
    FileUtils.mv(temp_path, output_path)
    
  rescue StandardError => e
    # Clean up temporary file on any error
    File.unlink(temp_path) if File.exist?(temp_path)
    raise "Archive creation failed: #{e.message}"
  end
end

def verify_archive_integrity(archive_path)
  entry_count = 0
  
  File.open(archive_path, 'rb') do |file|
    Archive::Tar::Minitar::Reader.open(file) do |reader|
      reader.each do |entry|
        entry_count += 1
        # Attempt to read entry content to verify it's accessible
        entry.read if entry.file?
      end
    end
  end
  
  raise "Archive appears empty or corrupted" if entry_count == 0
  entry_count
end

Large file handling requires special consideration for files exceeding memory capacity or tar format limitations. The traditional tar format has size limits that affect very large files.

# Handle large files and size limitations
def handle_large_files(file_paths, output_path)
  large_files = []
  total_size = 0
  
  file_paths.each do |path|
    next unless File.file?(path)
    
    size = File.size(path)
    total_size += size
    
    # Traditional tar format limit: 8GB
    if size > 8 * 1024 * 1024 * 1024
      large_files << { path: path, size: size }
    end
  end
  
  unless large_files.empty?
    warn "Warning: Files exceeding traditional tar limits:"
    large_files.each do |file_info|
      warn "  #{file_info[:path]}: #{file_info[:size]} bytes"
    end
    warn "Consider using GNU tar format or splitting large files"
  end
  
  # Proceed with archive creation, streaming large files
  File.open(output_path, 'wb') do |file|
    Archive::Tar::Minitar::Writer.open(file) do |writer|
      file_paths.each do |path|
        next unless File.file?(path)
        
        File.open(path, 'rb') do |input|
          writer.add_file_simple(path, File.stat(path).mode, File.size(path)) do |entry|
            while chunk = input.read(65536)  # 64KB chunks
              entry.write(chunk)
            end
          end
        end
      end
    end
  end
  
  { total_files: file_paths.length, total_size: total_size, large_files: large_files }
end

Reference

Core Classes and Methods

Class Purpose Key Methods
Archive::Tar::Minitar High-level archive operations pack, unpack
Archive::Tar::Minitar::Reader Streaming archive reading open, each, rewind
Archive::Tar::Minitar::Writer Streaming archive writing open, add_file_simple, add_file
Archive::Tar::Minitar::PosixHeader Entry metadata handling name, mode, size, mtime

High-Level Methods

Method Parameters Returns Description
Minitar.pack(sources, dest) sources (Array), dest (IO) nil Creates tar archive from file list
Minitar.unpack(src, dest, **opts) src (IO), dest (String), options (Hash) nil Extracts tar archive to directory

Streaming Reader Methods

Method Parameters Returns Description
Reader.open(io) io (IO object) Reader Opens tar stream for reading
Reader#each Block self Iterates over archive entries
Reader#rewind None self Resets stream position to beginning
Reader#close None nil Closes the tar stream

Streaming Writer Methods

Method Parameters Returns Description
Writer.open(io) io (IO object) Writer Opens tar stream for writing
Writer#add_file_simple(name, mode, size) name (String), mode (Integer), size (Integer) nil Adds file with basic metadata
Writer#add_file(name, mode) name (String), mode (Integer) nil Adds file with full metadata
Writer#close None nil Finalizes and closes tar stream

Entry Properties

Property Type Description
name String File path within archive
mode Integer File permissions (octal)
size Integer File size in bytes
mtime Time Modification timestamp
typeflag String Entry type indicator
linkname String Link target for symbolic links
uid Integer User ID of file owner
gid Integer Group ID of file owner
uname String Username of file owner
gname String Group name of file owner

Type Flag Constants

Flag Value Type
Normal File '0' or "\0" Regular file
Hard Link '1' Hard link to another file
Symbolic Link '2' Symbolic link
Character Device '3' Character special device
Block Device '4' Block special device
Directory '5' Directory
FIFO '6' Named pipe (FIFO)
Reserved '7' Reserved for future use

Common Options

Option Default Description
:fsync false Force filesystem sync after writes
:data_buffer nil Custom buffer for data operations
:verbose false Enable verbose output during operations

Exception Hierarchy

Exception Parent Description
Archive::Tar::Minitar::Error StandardError Base exception class
Archive::Tar::Minitar::InvalidTarStream Error Corrupted or invalid tar data
Archive::Tar::Minitar::UnexpectedEOF Error Premature end of archive
Archive::Tar::Minitar::NonSeekableStream Error Stream does not support seeking

Compression Integration

Library Compression Reader Class Writer Class
zlib Gzip Zlib::GzipReader Zlib::GzipWriter
bzip2-ffi Bzip2 Bzip2::FFI::Reader Bzip2::FFI::Writer
ruby-lzma LZMA/XZ LZMA::Reader LZMA::Writer