CrackedRuby CrackedRuby

File Formats (CSV, JSON, Parquet, Avro)

Overview

File formats define how data gets structured, serialized, and stored for persistence or transmission between systems. CSV, JSON, Parquet, and Avro represent different approaches to data serialization, each optimized for specific use cases and constraints.

CSV (Comma-Separated Values) stores tabular data in plain text with values separated by delimiters. Each line represents a row, with commas (or other characters) separating fields. CSV originated in early database and spreadsheet applications, becoming ubiquitous for data exchange due to its simplicity and human readability.

JSON (JavaScript Object Notation) serializes hierarchical data structures using human-readable text. Based on JavaScript object syntax, JSON represents data as key-value pairs, arrays, and nested objects. JSON became the dominant format for web APIs and configuration files due to its flexibility and language-independent parsing.

Parquet stores columnar data in a compressed, binary format optimized for analytical queries. Developed within the Apache Hadoop ecosystem, Parquet organizes data by columns rather than rows, enabling efficient compression and selective column reading for analytics workloads.

Avro serializes structured data in a compact binary format with schema evolution support. Created by Apache as part of the Hadoop ecosystem, Avro embeds schema information with data, enabling compatibility checks and schema migration across system versions.

These formats differ fundamentally in their data models, serialization approaches, and performance characteristics:

# CSV - simple tabular data
csv_data = <<~CSV
  name,age,city
  Alice,30,Boston
  Bob,25,Seattle
CSV

# JSON - hierarchical structures
json_data = {
  users: [
    { name: "Alice", age: 30, address: { city: "Boston" } },
    { name: "Bob", age: 25, address: { city: "Seattle" } }
  ]
}.to_json

# Parquet - columnar binary (typically handled by specialized libraries)
# Data stored in column chunks with compression and encoding

# Avro - row-based binary with schema
# Schema definition separate from data encoding

Format selection impacts system performance, storage costs, query patterns, and integration complexity. Text-based formats (CSV, JSON) prioritize human readability and interoperability at the cost of storage efficiency. Binary formats (Parquet, Avro) optimize for performance and schema management but require specialized tools for inspection.

Key Principles

Data serialization formats balance competing requirements: human readability versus machine efficiency, flexibility versus type safety, storage size versus query performance. Understanding these trade-offs guides format selection for specific use cases.

Schema Enforcement

CSV lacks formal schema definition. Column types get inferred during parsing, leading to ambiguity with numeric strings, dates, and null values. The parser guesses types based on value patterns:

require 'csv'

data = "id,value\n1,100\n2,abc\n3,\n"
parsed = CSV.parse(data, headers: true)

parsed[0]["value"]  # "100" (String, not Integer)
parsed[2]["value"]  # "" (empty string, not nil)

JSON provides type information for basic types (string, number, boolean, null, array, object) but lacks date, decimal, or custom type support. Type validation requires external schema languages like JSON Schema:

require 'json'

data = { timestamp: "2025-01-15", amount: 99.99 }
json_str = data.to_json  # {"timestamp":"2025-01-15","amount":99.99}

parsed = JSON.parse(json_str)
parsed["timestamp"]  # String - date information lost
parsed["amount"]     # Float - precision maintained

Parquet embeds schema in file metadata, enforcing column types at write time. The schema defines logical types (date, decimal, timestamp) that map to physical storage types:

# Parquet schema definition (conceptual)
schema = {
  fields: [
    { name: "id", type: "INT64", required: true },
    { name: "amount", type: "DECIMAL(10,2)", required: true },
    { name: "timestamp", type: "TIMESTAMP_MILLIS", required: false }
  ]
}

Avro requires explicit schema definition before serialization. The schema travels with the data or gets stored in a schema registry, enabling version compatibility checks:

# Avro schema definition
schema = {
  type: "record",
  name: "Transaction",
  fields: [
    { name: "id", type: "long" },
    { name: "amount", type: { type: "bytes", logicalType: "decimal", precision: 10, scale: 2 } },
    { name: "timestamp", type: ["null", "long"], default: nil }
  ]
}

Data Model and Nesting

CSV represents flat, tabular data. Each row contains the same fields in the same order. Nested structures require flattening or encoding as strings:

# Nested data must be flattened
address = { street: "123 Main", city: "Boston" }
csv_row = ["Alice", "123 Main", "Boston"]  # Flatten structure

# Or encode as JSON string
csv_row = ["Alice", '{"street":"123 Main","city":"Boston"}']  # Nested as string

JSON natively supports nested objects and arrays. Arbitrary nesting depth enables hierarchical data representation:

data = {
  user: {
    name: "Alice",
    addresses: [
      { type: "home", city: "Boston" },
      { type: "work", city: "Cambridge" }
    ]
  }
}

Parquet supports nested structures through repetition and definition levels. Nested columns get encoded as separate column chunks with metadata tracking nesting:

# Parquet nested schema
schema = {
  fields: [
    { name: "user.name", type: "STRING" },
    { name: "user.addresses", type: "LIST", element: {
      type: "STRUCT", fields: [
        { name: "type", type: "STRING" },
        { name: "city", type: "STRING" }
      ]
    }}
  ]
}

Avro supports complex types (records, arrays, maps, unions) through schema composition. Nested schemas define hierarchical structures with type safety:

# Avro nested schema
schema = {
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "addresses", type: {
      type: "array",
      items: {
        type: "record",
        name: "Address",
        fields: [
          { name: "type", type: "string" },
          { name: "city", type: "string" }
        ]
      }
    }}
  ]
}

Serialization Approach

CSV serializes row-by-row with delimiters separating values. Special characters require escaping through quoting:

require 'csv'

rows = [
  ["Name", "Description"],
  ["Product A", "Contains, comma"],
  ["Product B", 'Contains "quotes"']
]

csv_string = CSV.generate do |csv|
  rows.each { |row| csv << row }
end

# Output:
# Name,Description
# Product A,"Contains, comma"
# Product B,"Contains ""quotes"""

JSON serializes to text with syntax rules for objects, arrays, strings, and primitives. Serialization transforms Ruby objects to JSON text representation:

require 'json'

data = {
  products: [
    { name: "Product A", price: 10.50 },
    { name: "Product B", price: 20.00 }
  ]
}

json = JSON.generate(data)
# {"products":[{"name":"Product A","price":10.5},{"name":"Product B","price":20.0}]}

Parquet serializes data column-by-column, applying encoding and compression to each column independently. Values for a single column get stored contiguously:

# Conceptual column-wise storage
column_data = {
  "name" => ["Alice", "Bob", "Carol"],         # Stored together
  "age" => [30, 25, 35],                       # Stored together
  "city" => ["Boston", "Seattle", "Boston"]    # Stored together, compressed
}

Avro serializes data row-by-row in compact binary format. The schema defines field order and encoding rules:

# Avro binary encoding (conceptual)
# Schema: { name: string, age: int }
# Data: { name: "Alice", age: 30 }
# Binary: [5, "Alice", 30]  # 5 = string length

Schema Evolution

CSV lacks schema evolution mechanisms. Adding fields requires manual coordination and column position assumptions:

# Original CSV
original = "name,age\nAlice,30\n"

# Adding field requires coordination
evolved = "name,age,city\nAlice,30,Boston\n"  # New field added
legacy = "name,age\nBob,25\n"                 # Old format incompatible

JSON handles field additions gracefully through object structure. Parsers ignore unknown fields, and missing fields default to nil:

require 'json'

old_format = { name: "Alice", age: 30 }.to_json
new_format = { name: "Bob", age: 25, city: "Seattle" }.to_json

# Parser handles both
JSON.parse(old_format)  # Missing city field
JSON.parse(new_format)  # Additional city field

Parquet supports schema evolution through column projection. Readers can request specific columns, ignoring new or missing fields:

# Write with schema v1: [name, age]
# Write with schema v2: [name, age, city]
# Reader can request [name, age] from both files

Avro provides explicit schema evolution rules. Reader and writer schemas can differ if compatible according to resolution rules:

# Writer schema v1
writer_schema = {
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "age", type: "int" }
  ]
}

# Reader schema v2 (compatible evolution)
reader_schema = {
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "age", type: "int" },
    { name: "city", type: "string", default: "Unknown" }  # New field with default
  ]
}

Ruby Implementation

Ruby provides built-in support for CSV and JSON through standard library modules. Binary formats require external gems with native extensions for performance.

CSV Processing

Ruby's CSV library handles reading, writing, and parsing CSV data. The library supports various options for delimiters, quoting, and encoding:

require 'csv'

# Writing CSV
CSV.open("output.csv", "w") do |csv|
  csv << ["Name", "Age", "City"]
  csv << ["Alice", 30, "Boston"]
  csv << ["Bob", 25, "Seattle"]
end

# Reading CSV with headers
data = CSV.read("output.csv", headers: true)
data.each do |row|
  puts "#{row['Name']} is #{row['Age']} years old"
end

# Parsing CSV string
csv_string = "name,age\nAlice,30\nBob,25"
parsed = CSV.parse(csv_string, headers: true, header_converters: :symbol)
parsed.first[:name]  # "Alice"
parsed.first[:age]   # "30" (still a string)

The CSV library provides converters for automatic type conversion:

require 'csv'

# Define custom converter
CSV::Converters[:blank_to_nil] = ->(value) {
  value && value.empty? ? nil : value
}

data = CSV.parse(<<~CSV, headers: true, converters: [:numeric, :blank_to_nil])
  name,age,score
  Alice,30,95.5
  Bob,,
  Carol,25,88.0
CSV

data[0]["age"]   # 30 (Integer)
data[0]["score"] # 95.5 (Float)
data[1]["age"]   # nil (converted from empty)

CSV options control parsing behavior for edge cases:

require 'csv'

# Custom delimiter and quote character
data = CSV.parse("name|age\nAlice|30", col_sep: "|", headers: true)

# Liberal parsing for malformed CSV
malformed = "name,age\nAlice,30,extra\nBob"
CSV.parse(malformed, headers: true, liberal_parsing: true)

# Skip blank lines
csv_with_blanks = "name,age\n\nAlice,30\n\nBob,25"
CSV.parse(csv_with_blanks, headers: true, skip_blanks: true)

JSON Serialization

Ruby's JSON library handles encoding and decoding JSON data. The library integrates with Ruby's object system through to_json and from_json methods:

require 'json'

# Encoding Ruby objects to JSON
data = {
  users: [
    { name: "Alice", age: 30, active: true },
    { name: "Bob", age: 25, active: false }
  ]
}

json_string = JSON.generate(data)
# or using to_json
json_string = data.to_json

# Pretty printing
puts JSON.pretty_generate(data)
# {
#   "users": [
#     {
#       "name": "Alice",
#       "age": 30,
#       "active": true
#     }
#   ]
# }

JSON parsing converts JSON strings to Ruby data structures:

require 'json'

json_string = '{"name":"Alice","scores":[95,88,92],"metadata":null}'

# Parse JSON
data = JSON.parse(json_string)
data["name"]              # "Alice"
data["scores"]            # [95, 88, 92]
data["metadata"]          # nil

# Parse with symbol keys
data = JSON.parse(json_string, symbolize_names: true)
data[:name]               # "Alice"

# Streaming parser for large files
File.open("large.json") do |file|
  JSON.load(file)
end

Custom JSON serialization for Ruby objects:

require 'json'

class User
  attr_accessor :name, :age, :created_at
  
  def initialize(name, age)
    @name = name
    @age = age
    @created_at = Time.now
  end
  
  def to_json(*args)
    {
      name: @name,
      age: @age,
      created_at: @created_at.iso8601
    }.to_json(*args)
  end
  
  def self.from_json(json_string)
    data = JSON.parse(json_string)
    user = new(data["name"], data["age"])
    user.created_at = Time.parse(data["created_at"])
    user
  end
end

user = User.new("Alice", 30)
json = user.to_json  # Custom serialization
restored = User.from_json(json)

Parquet Integration

Ruby requires the red-parquet gem for Parquet file support. Red-parquet builds on Apache Arrow for efficient columnar data handling:

require 'parquet'

# Writing Parquet files
table = Arrow::Table.new(
  name: ["Alice", "Bob", "Carol"],
  age: [30, 25, 35],
  salary: [75000.0, 68000.0, 82000.0]
)

Arrow::Table.save(table, "employees.parquet", format: :parquet)

# Reading Parquet files
table = Arrow::Table.load("employees.parquet", format: :parquet)

# Access columns
table.column("name").to_a      # ["Alice", "Bob", "Carol"]
table.column("age").to_a       # [30, 25, 35]

# Filter rows
filtered = table.filter { |row| row["age"] > 25 }

Parquet schema definition and type mapping:

require 'parquet'

# Define schema with specific types
schema = Arrow::Schema.new([
  Arrow::Field.new("id", :int64),
  Arrow::Field.new("name", :string),
  Arrow::Field.new("amount", Arrow::Decimal128DataType.new(10, 2)),
  Arrow::Field.new("timestamp", :timestamp, unit: :millisecond)
])

# Build table with schema
builder = Arrow::RecordBatchBuilder.new(schema)
builder.append([
  [1, 2, 3],
  ["Alice", "Bob", "Carol"],
  [BigDecimal("99.99"), BigDecimal("150.50"), BigDecimal("200.00")],
  [Time.now, Time.now - 3600, Time.now + 3600]
])

table = Arrow::Table.new(builder.flush)
Arrow::Table.save(table, "transactions.parquet", format: :parquet)

Parquet reading with column projection and filtering:

require 'parquet'

# Read specific columns only
table = Arrow::Table.load(
  "employees.parquet",
  format: :parquet,
  columns: ["name", "salary"]  # Only load these columns
)

# Row group filtering (predicate pushdown)
# Parquet stores data in row groups - filtering at row group level
# avoids reading unnecessary data
table = Arrow::Table.load("employees.parquet", format: :parquet)
high_earners = table.filter { |row| row["salary"] > 70000 }

Avro Processing

Ruby uses the avro gem for Avro serialization. Avro requires schema definition before encoding or decoding data:

require 'avro'

# Define Avro schema
schema_json = {
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "age", type: "int" },
    { name: "email", type: ["null", "string"], default: nil }
  ]
}.to_json

schema = Avro::Schema.parse(schema_json)

# Write Avro data
file = File.open("users.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(schema), schema)

writer << { "name" => "Alice", "age" => 30, "email" => "alice@example.com" }
writer << { "name" => "Bob", "age" => 25, "email" => nil }

writer.close

# Read Avro data
file = File.open("users.avro", "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)

reader.each do |record|
  puts "#{record['name']}: #{record['age']}"
end

reader.close

Avro schema evolution with compatible changes:

require 'avro'

# Writer schema (v1)
writer_schema = Avro::Schema.parse({
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "age", type: "int" }
  ]
}.to_json)

# Reader schema (v2) - adds optional field
reader_schema = Avro::Schema.parse({
  type: "record",
  name: "User",
  fields: [
    { name: "name", type: "string" },
    { name: "age", type: "int" },
    { name: "city", type: "string", default: "Unknown" }
  ]
}.to_json)

# Write with v1 schema
file = File.open("users_v1.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(writer_schema), writer_schema)
writer << { "name" => "Alice", "age" => 30 }
writer.close

# Read with v2 schema - default value applied
file = File.open("users_v1.avro", "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new(writer_schema, reader_schema))
record = reader.first
record["city"]  # "Unknown" (default value)
reader.close

Avro encoding with complex types:

require 'avro'

# Complex schema with nested types
schema = Avro::Schema.parse({
  type: "record",
  name: "Order",
  fields: [
    { name: "order_id", type: "long" },
    { name: "items", type: {
      type: "array",
      items: {
        type: "record",
        name: "Item",
        fields: [
          { name: "product", type: "string" },
          { name: "quantity", type: "int" },
          { name: "price", type: "double" }
        ]
      }
    }},
    { name: "metadata", type: {
      type: "map",
      values: "string"
    }}
  ]
}.to_json)

# Write complex data
file = File.open("orders.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(schema), schema)

writer << {
  "order_id" => 12345,
  "items" => [
    { "product" => "Widget", "quantity" => 2, "price" => 19.99 },
    { "product" => "Gadget", "quantity" => 1, "price" => 49.99 }
  ],
  "metadata" => { "source" => "web", "campaign" => "spring2025" }
}

writer.close

Design Considerations

Format selection depends on data characteristics, access patterns, system constraints, and interoperability requirements. Each format addresses different priorities in the data management spectrum.

When to Use CSV

CSV works for simple tabular data requiring human readability and universal compatibility. Spreadsheet tools, database import/export, and data exchange between heterogeneous systems favor CSV due to minimal tooling requirements.

Use CSV when:

# Simple tabular export for business users
require 'csv'

def export_sales_report(sales_data)
  CSV.generate(headers: true) do |csv|
    csv << ["Date", "Product", "Quantity", "Revenue"]
    sales_data.each do |sale|
      csv << [sale.date, sale.product, sale.quantity, sale.revenue]
    end
  end
end

# Easy inspection and manual editing
# Users can open in Excel, Google Sheets, or text editor

Avoid CSV when:

  • Data contains nested structures (addresses, line items, hierarchies)
  • Precise type information matters (dates, decimals, nulls versus empty strings)
  • File size exceeds several hundred megabytes (parsing becomes slow)
  • Schema changes frequently (column additions break parsers)
  • Data contains many special characters requiring escaping

When to Use JSON

JSON handles hierarchical data structures with moderate nesting and heterogeneous schemas. REST APIs, configuration files, and document stores rely on JSON for flexible data representation without strict schema requirements.

Use JSON when:

require 'json'

# API response with nested data
api_response = {
  user: {
    id: 12345,
    name: "Alice Smith",
    preferences: {
      theme: "dark",
      notifications: ["email", "sms"]
    }
  },
  orders: [
    {
      id: 1001,
      items: [
        { product: "Widget", qty: 2 },
        { product: "Gadget", qty: 1 }
      ],
      total: 89.97
    }
  ]
}.to_json

# Configuration file with nested sections
config = {
  database: {
    host: "localhost",
    port: 5432,
    pool: { min: 5, max: 20 }
  },
  cache: {
    enabled: true,
    backend: "redis",
    ttl: 3600
  }
}
File.write("config.json", JSON.pretty_generate(config))

Avoid JSON when:

  • Data volume exceeds hundreds of megabytes (parsing becomes memory-intensive)
  • Analytics queries need to scan specific columns (row-oriented format inefficient)
  • Binary data dominates (base64 encoding inflates size)
  • Schema enforcement prevents invalid data (JSON lacks type validation)
  • Numeric precision matters (IEEE 754 double precision limitations)

When to Use Parquet

Parquet optimizes for analytical queries over large datasets with repeated column access. Data warehouses, analytics platforms, and batch processing pipelines benefit from columnar storage and efficient compression.

Use Parquet when:

require 'parquet'

# Analytics query reading specific columns
def analyze_sales(file_path)
  # Parquet reads only needed columns from disk
  table = Arrow::Table.load(
    file_path,
    format: :parquet,
    columns: ["product_id", "revenue", "timestamp"]  # Skip other columns
  )
  
  # Aggregate by product
  table.group_by("product_id").aggregate("revenue" => ["sum", "avg"])
end

# Large dataset with high compression
# 10GB CSV might compress to 1GB Parquet with dictionary encoding
# and delta compression on sorted columns

Avoid Parquet when:

  • Row-based access dominates (retrieving full records repeatedly)
  • Real-time updates required (Parquet files immutable after writing)
  • Human inspection needed (binary format requires specialized tools)
  • Small datasets under 10MB (overhead not justified)
  • Stream processing with record-by-record operations

When to Use Avro

Avro supports schema evolution in streaming systems and data pipelines with version compatibility requirements. Kafka message serialization, data lake ingestion, and RPC systems use Avro for schema management across service versions.

Use Avro when:

require 'avro'

# Message serialization with schema evolution
class EventProducer
  def initialize(schema_registry)
    @schema_registry = schema_registry
    @schema = load_latest_schema("user_event")
  end
  
  def produce_event(event_data)
    # Schema versioning ensures compatibility
    writer = Avro::IO::DatumWriter.new(@schema)
    buffer = StringIO.new
    encoder = Avro::IO::BinaryEncoder.new(buffer)
    writer.write(event_data, encoder)
    
    {
      schema_id: @schema.version,
      payload: buffer.string
    }
  end
end

# Schema evolution with backward/forward compatibility
# Old readers can process new data (backward compatible)
# New readers can process old data (forward compatible)

Avoid Avro when:

  • Schema changes infrequently or never (overhead not justified)
  • Human readability matters for debugging (binary format opaque)
  • Ad-hoc queries dominate (columnar format better)
  • Minimal tooling requirements (CSV/JSON simpler)
  • Small message sizes where overhead matters (protobuf more compact)

Hybrid Approaches

Real systems often combine formats based on data lifecycle stages:

# Ingestion: CSV/JSON for compatibility
# → Transform to Parquet for analytics storage
# → Avro for streaming between services
# → JSON for API responses

class DataPipeline
  def process_upload(csv_file)
    # Read CSV input
    data = CSV.read(csv_file, headers: true)
    
    # Transform and validate
    records = data.map { |row| transform_row(row) }
    
    # Write to Parquet for analytics
    table = build_arrow_table(records)
    Arrow::Table.save(table, "analytics.parquet", format: :parquet)
    
    # Stream events to Kafka in Avro
    records.each { |record| publish_avro_event(record) }
    
    # Return JSON response
    { processed: records.size, status: "complete" }.to_json
  end
end

Practical Examples

Real-world usage demonstrates format characteristics, performance trade-offs, and integration patterns across different scenarios.

CSV Data Processing Pipeline

Processing large CSV files requires streaming to avoid memory exhaustion:

require 'csv'

# Stream large CSV file without loading into memory
def process_large_csv(input_file, output_file)
  errors = []
  processed = 0
  
  CSV.open(output_file, "w", headers: true) do |output|
    CSV.foreach(input_file, headers: true).with_index do |row, idx|
      begin
        # Validate and transform
        validated = validate_row(row, idx)
        output << validated if validated
        processed += 1
      rescue ValidationError => e
        errors << { line: idx + 1, error: e.message }
      end
      
      # Progress reporting for large files
      puts "Processed #{processed} records" if (processed % 10000).zero?
    end
  end
  
  { processed: processed, errors: errors }
end

def validate_row(row, line_number)
  # Type conversion and validation
  age = Integer(row["age"]) rescue raise ValidationError, "Invalid age"
  email = row["email"]
  raise ValidationError, "Missing email" if email.nil? || email.empty?
  
  [row["name"], age, email.downcase]
rescue ArgumentError => e
  raise ValidationError, "Line #{line_number}: #{e.message}"
end

class ValidationError < StandardError; end

CSV aggregation with grouping and statistics:

require 'csv'

def analyze_sales_csv(file_path)
  sales_by_product = Hash.new { |h, k| h[k] = { qty: 0, revenue: 0.0 } }
  
  CSV.foreach(file_path, headers: true, converters: :numeric) do |row|
    product = row["product"]
    sales_by_product[product][:qty] += row["quantity"]
    sales_by_product[product][:revenue] += row["price"] * row["quantity"]
  end
  
  # Calculate averages and format output
  results = sales_by_product.map do |product, data|
    {
      product: product,
      total_quantity: data[:qty],
      total_revenue: data[:revenue],
      avg_price: data[:revenue] / data[:qty]
    }
  end
  
  # Sort by revenue descending
  results.sort_by { |r| -r[:total_revenue] }
end

JSON API Integration

Building and consuming JSON APIs with proper error handling:

require 'json'
require 'net/http'
require 'uri'

class APIClient
  def initialize(base_url, api_key)
    @base_url = base_url
    @api_key = api_key
  end
  
  def get_user(user_id)
    uri = URI("#{@base_url}/users/#{user_id}")
    request = Net::HTTP::Get.new(uri)
    request["Authorization"] = "Bearer #{@api_key}"
    request["Content-Type"] = "application/json"
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request)
    end
    
    handle_response(response)
  end
  
  def create_user(user_data)
    uri = URI("#{@base_url}/users")
    request = Net::HTTP::Post.new(uri)
    request["Authorization"] = "Bearer #{@api_key}"
    request["Content-Type"] = "application/json"
    request.body = user_data.to_json
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request)
    end
    
    handle_response(response)
  end
  
  private
  
  def handle_response(response)
    case response.code.to_i
    when 200..299
      JSON.parse(response.body, symbolize_names: true)
    when 400..499
      error_data = JSON.parse(response.body) rescue {}
      raise ClientError, "#{response.code}: #{error_data['message']}"
    when 500..599
      raise ServerError, "Server error: #{response.code}"
    else
      raise APIError, "Unexpected response: #{response.code}"
    end
  rescue JSON::ParserError => e
    raise APIError, "Invalid JSON response: #{e.message}"
  end
end

class APIError < StandardError; end
class ClientError < APIError; end
class ServerError < APIError; end

# Usage
client = APIClient.new("https://api.example.com", "secret_key")
user = client.get_user(12345)
puts user[:name]

Nested JSON transformation and flattening:

require 'json'

def flatten_json(nested_hash, parent_key = nil, result = {})
  nested_hash.each do |key, value|
    new_key = parent_key ? "#{parent_key}.#{key}" : key.to_s
    
    if value.is_a?(Hash)
      flatten_json(value, new_key, result)
    elsif value.is_a?(Array)
      value.each_with_index do |item, idx|
        if item.is_a?(Hash) || item.is_a?(Array)
          flatten_json(item, "#{new_key}[#{idx}]", result)
        else
          result["#{new_key}[#{idx}]"] = item
        end
      end
    else
      result[new_key] = value
    end
  end
  
  result
end

# Example usage
nested = {
  user: {
    id: 123,
    profile: {
      name: "Alice",
      contacts: [
        { type: "email", value: "alice@example.com" },
        { type: "phone", value: "555-0100" }
      ]
    }
  }
}

flattened = flatten_json(nested)
# {
#   "user.id" => 123,
#   "user.profile.name" => "Alice",
#   "user.profile.contacts[0].type" => "email",
#   "user.profile.contacts[0].value" => "alice@example.com",
#   "user.profile.contacts[1].type" => "phone",
#   "user.profile.contacts[1].value" => "555-0100"
# }

Parquet Analytics Workflow

Building analytics tables from multiple sources:

require 'parquet'

class AnalyticsBuilder
  def build_user_analytics(user_events_path, user_profiles_path, output_path)
    # Load event data
    events = Arrow::Table.load(user_events_path, format: :parquet)
    profiles = Arrow::Table.load(user_profiles_path, format: :parquet)
    
    # Join tables on user_id
    joined = join_tables(events, profiles, "user_id")
    
    # Aggregate metrics
    analytics = joined.group_by("user_id", "signup_date").aggregate(
      "event_count" => ["count"],
      "page_views" => ["sum"],
      "session_duration" => ["sum", "avg"]
    )
    
    # Add computed columns
    enriched = add_computed_columns(analytics)
    
    # Write output with compression
    Arrow::Table.save(
      enriched,
      output_path,
      format: :parquet,
      compression: :snappy
    )
  end
  
  private
  
  def join_tables(left, right, key)
    # Simplified join logic
    left_data = table_to_hash(left, key)
    right_data = table_to_hash(right, key)
    
    joined_rows = left_data.map do |k, left_row|
      right_row = right_data[k] || {}
      left_row.merge(right_row)
    end
    
    hash_to_table(joined_rows)
  end
  
  def add_computed_columns(table)
    # Add engagement score based on metrics
    rows = table.each_row.map do |row|
      score = calculate_engagement_score(
        row["event_count"],
        row["page_views"],
        row["session_duration_avg"]
      )
      row.to_h.merge("engagement_score" => score)
    end
    
    hash_to_table(rows)
  end
  
  def calculate_engagement_score(events, views, avg_duration)
    (events * 0.3 + views * 0.5 + avg_duration / 60 * 0.2).round(2)
  end
end

Parquet partitioning for efficient queries:

require 'parquet'

class PartitionedWriter
  def write_partitioned_data(data, base_path, partition_keys)
    # Group data by partition keys
    partitions = Hash.new { |h, k| h[k] = [] }
    
    data.each do |row|
      partition_values = partition_keys.map { |key| row[key] }
      partitions[partition_values] << row
    end
    
    # Write each partition to separate file
    partitions.each do |partition_values, rows|
      partition_path = build_partition_path(base_path, partition_keys, partition_values)
      FileUtils.mkdir_p(File.dirname(partition_path))
      
      table = build_table(rows)
      Arrow::Table.save(table, partition_path, format: :parquet)
    end
  end
  
  def build_partition_path(base, keys, values)
    partition_dirs = keys.zip(values).map { |k, v| "#{k}=#{v}" }
    File.join(base, *partition_dirs, "data.parquet")
  end
end

# Usage: partition by year and month
writer = PartitionedWriter.new
writer.write_partitioned_data(
  events,
  "/data/events",
  ["year", "month"]
)
# Creates structure:
# /data/events/year=2025/month=01/data.parquet
# /data/events/year=2025/month=02/data.parquet

Avro Event Streaming

Producing and consuming Avro events with schema evolution:

require 'avro'

class EventStream
  def initialize(schema_path)
    @schema = Avro::Schema.parse(File.read(schema_path))
  end
  
  def produce_event(event_data, output_file)
    file = File.open(output_file, "ab")  # Append mode
    writer = Avro::DataFile::Writer.new(
      file,
      Avro::IO::DatumWriter.new(@schema),
      @schema
    )
    
    # Validate event against schema
    validate_event(event_data)
    
    writer << event_data
    writer.close
  end
  
  def consume_events(input_file)
    file = File.open(input_file, "rb")
    reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
    
    reader.each do |event|
      yield event
    end
    
    reader.close
  end
  
  private
  
  def validate_event(event_data)
    @schema.fields.each do |field|
      value = event_data[field.name]
      
      if field.default.nil? && value.nil? && !field.type.respond_to?(:types)
        raise ValidationError, "Required field #{field.name} missing"
      end
    end
  end
end

# Event processor with schema evolution
class EventProcessor
  def initialize(reader_schema_path, writer_schema_path)
    @reader_schema = Avro::Schema.parse(File.read(reader_schema_path))
    @writer_schema = Avro::Schema.parse(File.read(writer_schema_path))
  end
  
  def process_file(input_file, output_file)
    input = File.open(input_file, "rb")
    output = File.open(output_file, "wb")
    
    reader = Avro::DataFile::Reader.new(
      input,
      Avro::IO::DatumReader.new(@reader_schema, @writer_schema)
    )
    
    writer = Avro::DataFile::Writer.new(
      output,
      Avro::IO::DatumWriter.new(@writer_schema),
      @writer_schema
    )
    
    reader.each do |event|
      transformed = transform_event(event)
      writer << transformed
    end
    
    reader.close
    writer.close
  end
  
  private
  
  def transform_event(event)
    # Transform from old schema to new schema
    # Apply business logic, enrichment, etc.
    event.merge(
      "processed_at" => Time.now.to_i,
      "version" => 2
    )
  end
end

Performance Considerations

File format performance varies across dimensions: serialization speed, storage size, query efficiency, and memory usage. Understanding these trade-offs guides format selection for specific workloads.

Storage Size and Compression

CSV stores data in plain text with minimal compression opportunities. Each value serialized as human-readable text inflates storage:

# CSV storage characteristics
csv_data = <<~CSV
  user_id,timestamp,event_type,page_url
  123456,2025-01-15T10:30:00Z,page_view,https://example.com/products/widget-2000
  123456,2025-01-15T10:31:00Z,page_view,https://example.com/products/widget-3000
  123456,2025-01-15T10:32:00Z,click,https://example.com/products/widget-2000/buy
CSV

# Storage: ~230 bytes for 3 rows
# Integer IDs stored as text: "123456" = 6 bytes vs 4 bytes binary
# Repeated URLs fully written each time
# Column names repeated as headers

JSON adds structural overhead with syntax characters and key repetition:

json_data = [
  {
    user_id: 123456,
    timestamp: "2025-01-15T10:30:00Z",
    event_type: "page_view",
    page_url: "https://example.com/products/widget-2000"
  },
  {
    user_id: 123456,
    timestamp: "2025-01-15T10:31:00Z",
    event_type: "page_view",
    page_url: "https://example.com/products/widget-3000"
  }
].to_json

# Storage: ~380 bytes for 2 events
# Key names repeated for each object
# Brackets, braces, quotes add overhead
# Numbers stored as text in JSON

Parquet applies columnar compression and encoding:

# Parquet storage optimizations
# Dictionary encoding for repeated values:
#   event_type column: ["page_view", "click"]
#   Store: dictionary + indices
#   "page_view" appears 1000 times → stored once + 1000 indices
#
# Run-length encoding for sequential values:
#   user_id: [123456, 123456, 123456, 123457, 123457]
#   Store: [(123456, count=3), (123457, count=2)]
#
# Delta encoding for sorted values:
#   timestamps: [1000, 1001, 1002, 1003]
#   Store: 1000 + [1, 1, 1]
#
# Result: 10GB CSV → 1-2GB Parquet typical compression ratio

Avro binary encoding reduces size compared to text formats:

# Avro storage characteristics
# Compact binary encoding:
#   Integer 123456 → variable-length encoding (3 bytes)
#   String "page_view" → length + bytes
#   Schema stored once, not repeated per record
#
# Block compression:
#   Multiple records compressed together
#   Snappy/Deflate compression on blocks
#
# Result: Smaller than JSON, larger than Parquet
# Good balance of size and random access

Read Performance

CSV requires full file scanning for queries. Parsing converts text to objects:

require 'csv'
require 'benchmark'

# CSV full scan required for filtering
def find_user_events_csv(file_path, user_id)
  results = []
  CSV.foreach(file_path, headers: true) do |row|
    results << row if row["user_id"] == user_id.to_s
  end
  results
end

# Benchmark: CSV parsing overhead
Benchmark.measure do
  CSV.read("large_file.csv", headers: true, converters: :numeric)
end
# Time dominated by:
# - Text parsing (splitting, unescaping)
# - Type conversion (string → numeric)
# - Full file scan (no indexing)

JSON requires deserializing entire objects:

require 'json'

# JSON reading characteristics
def find_user_events_json(file_path, user_id)
  data = JSON.parse(File.read(file_path))
  data.select { |event| event["user_id"] == user_id }
end

# Performance issues:
# - Must parse entire file into memory
# - Object construction overhead
# - No predicate pushdown
# - No column projection

Parquet enables column pruning and predicate pushdown:

require 'parquet'

# Parquet columnar reading
def find_user_events_parquet(file_path, user_id, needed_columns)
  table = Arrow::Table.load(
    file_path,
    format: :parquet,
    columns: needed_columns  # Only read these columns from disk
  )
  
  table.filter { |row| row["user_id"] == user_id }
end

# Performance advantages:
# - Skip unused columns entirely (I/O savings)
# - Row group filtering via statistics
# - Efficient compression maintains good I/O
# - Vectorized processing in Arrow

Avro supports sequential streaming with schema projection:

require 'avro'

# Avro streaming read
def process_events_streaming(file_path)
  file = File.open(file_path, "rb")
  reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
  
  reader.each do |event|
    # Process one record at a time
    # Memory efficient for large files
    yield event
  end
  
  reader.close
end

# Performance characteristics:
# - Low memory footprint (streaming)
# - Fast binary deserialization
# - Schema reading enables projection
# - Block-level compression

Write Performance

CSV writing appends text with escaping:

require 'csv'
require 'benchmark'

# CSV write performance
Benchmark.measure do
  CSV.open("output.csv", "w") do |csv|
    10_000.times do |i|
      csv << [i, "user_#{i}", Time.now.iso8601, rand(100)]
    end
  end
end

# Bottlenecks:
# - String formatting for each value
# - Quote escaping checks
# - Newline handling
# - No batch optimization

JSON serialization constructs syntax:

require 'json'

# JSON write performance
Benchmark.measure do
  data = 10_000.times.map do |i|
    {
      id: i,
      name: "user_#{i}",
      timestamp: Time.now.iso8601,
      score: rand(100)
    }
  end
  
  File.write("output.json", JSON.generate(data))
end

# Overhead:
# - Object traversal
# - Syntax character insertion
# - String escaping
# - Full serialization before write

Parquet batched writes with compression:

require 'parquet'

# Parquet write performance
Benchmark.measure do
  rows = 10_000.times.map do |i|
    [i, "user_#{i}", Time.now, rand(100)]
  end
  
  table = Arrow::Table.new(
    id: rows.map(&:first),
    name: rows.map { |r| r[1] },
    timestamp: rows.map { |r| r[2] },
    score: rows.map { |r| r[3] }
  )
  
  Arrow::Table.save(table, "output.parquet", format: :parquet, compression: :snappy)
end

# Optimizations:
# - Columnar layout enables vectorization
# - Batch encoding (dictionary, RLE)
# - Parallel column compression
# - Efficient binary encoding

Avro buffered writing:

require 'avro'

# Avro write with sync intervals
file = File.open("output.avro", "wb")
writer = Avro::DataFile::Writer.new(file, datum_writer, schema)

10_000.times do |i|
  writer << { id: i, name: "user_#{i}", score: rand(100) }
  
  # Sync every 1000 records (configurable)
  writer.sync if (i % 1000).zero?
end

writer.close

# Performance features:
# - Buffered writes reduce I/O
# - Block compression
# - Configurable sync markers
# - Fast binary encoding

Memory Usage

CSV streaming enables low memory processing:

require 'csv'

# Memory-efficient CSV processing
def process_large_csv(file_path)
  CSV.foreach(file_path, headers: true) do |row|
    # Process one row at a time
    # Memory usage constant regardless of file size
    process_row(row)
  end
end

# Memory: O(1) per row, independent of file size

JSON requires loading full document:

require 'json'

# JSON memory usage
def process_json(file_path)
  data = JSON.parse(File.read(file_path))  # Load entire file
  data.each { |item| process_item(item) }
end

# Memory: O(n) where n is data size
# Risk: OutOfMemory for large files

Parquet with Arrow uses memory mapping:

require 'parquet'

# Parquet memory-mapped reading
def process_parquet_columns(file_path, columns)
  table = Arrow::Table.load(file_path, format: :parquet, columns: columns)
  
  # Arrow uses memory mapping when possible
  # Data accessed on-demand from disk
  table.each_row { |row| process_row(row) }
end

# Memory: Depends on column selection and row group size
# More efficient than loading all data

Avro supports streaming iteration:

require 'avro'

# Avro streaming reduces memory
def process_avro_stream(file_path)
  file = File.open(file_path, "rb")
  reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
  
  reader.each do |record|
    process_record(record)
  end
  
  reader.close
end

# Memory: O(1) per record + compression buffer

Tools & Ecosystem

Each file format has supporting libraries, command-line tools, and integration points within the Ruby ecosystem and broader data infrastructure.

CSV Tools

Ruby standard library provides comprehensive CSV support:

require 'csv'

# CSV module features
CSV::Converters  # Built-in type converters
CSV::HeaderConverters  # Header transformation

# Custom converter registration
CSV::Converters[:money] = ->(value) {
  value.gsub(/[$,]/, '').to_f
}

CSV.parse("$1,234.56", converters: [:money])

External gems extend CSV functionality:

# smarter_csv - enhanced parsing with chunking
require 'smarter_csv'

options = {
  chunk_size: 1000,
  key_mapping: { 'User Name' => :user_name },
  remove_empty_values: true
}

SmarterCSV.process('large.csv', options) do |chunk|
  chunk.each { |row| process_row(row) }
end

# roo - reading Excel, OpenOffice, CSV
require 'roo'

xlsx = Roo::Spreadsheet.open('data.xlsx')
csv = Roo::CSV.new('data.csv')
csv.each_row { |row| puts row.inspect }

Command-line tools for CSV manipulation:

# csvkit - command-line CSV toolkit
csvstat data.csv              # Column statistics
csvcut -c name,age data.csv   # Select columns
csvgrep -c age -m 30 data.csv # Filter rows
csvsql --query "SELECT * FROM data WHERE age > 30" data.csv

# xsv - fast CSV toolkit in Rust
xsv select name,age data.csv  # Column selection
xsv stats data.csv            # Statistics
xsv join id data1.csv id data2.csv  # Join files

JSON Tools

Ruby JSON library and extensions:

require 'json'

# Fast JSON parsing with Oj gem
require 'oj'

# Oj (Optimized JSON) - faster than standard library
data = Oj.load(json_string)
json = Oj.dump(data)

# Oj modes
Oj.load(json_string, mode: :strict)    # Strict JSON
Oj.load(json_string, mode: :compat)    # Compatible with JSON gem
Oj.load(json_string, mode: :rails)     # Rails-specific

# MultiJson - abstraction over JSON libraries
require 'multi_json'

MultiJson.load(json_string)
MultiJson.dump(data)
MultiJson.engine = :oj  # Use Oj backend

JSON Schema validation:

require 'json-schema'

schema = {
  "type" => "object",
  "required" => ["name", "age"],
  "properties" => {
    "name" => { "type" => "string" },
    "age" => { "type" => "integer", "minimum" => 0 }
  }
}

data = { "name" => "Alice", "age" => 30 }
JSON::Validator.validate!(schema, data)  # Raises if invalid

Command-line JSON tools:

# jq - JSON processor
cat data.json | jq '.users[] | select(.age > 30)'
cat data.json | jq '[.users[].name]'
cat data.json | jq '.users | group_by(.city)'

# fx - terminal JSON viewer
fx data.json

# json2csv - convert JSON to CSV
json2csv -i data.json -o data.csv -k name,age,city

Parquet Tools

Ruby Parquet support through red-parquet gem:

require 'parquet'

# red-parquet builds on Apache Arrow
table = Arrow::Table.load("data.parquet", format: :parquet)

# Access Arrow functionality
table.n_rows
table.n_columns
table.schema
table.columns

# Table operations
filtered = table.filter { |row| row["age"] > 30 }
sorted = table.sort_by { |a, b| a["age"] <=> b["age"] }

Parquet metadata inspection:

require 'parquet'

# Read Parquet metadata
metadata = Arrow::Table.metadata("data.parquet")

# Schema information
schema = Arrow::Table.schema("data.parquet")
schema.fields.each do |field|
  puts "#{field.name}: #{field.data_type}"
end

# File statistics
file_metadata = Arrow::ParquetFileReader.new("data.parquet")
file_metadata.num_row_groups
file_metadata.num_rows
file_metadata.metadata

Command-line Parquet tools:

# parquet-tools - Java-based toolkit
parquet-tools schema data.parquet    # Show schema
parquet-tools meta data.parquet      # Show metadata
parquet-tools cat data.parquet       # Show data
parquet-tools head -n 10 data.parquet

# parquet-cli - command-line interface
parquet schema data.parquet
parquet cat --columns name,age data.parquet
parquet to-avro data.parquet output.avro

# DuckDB - query Parquet files with SQL
duckdb -c "SELECT * FROM 'data.parquet' WHERE age > 30"

Avro Tools

Ruby Avro gem:

require 'avro'

# Schema parsing and validation
schema_json = File.read("schema.avsc")
schema = Avro::Schema.parse(schema_json)

# Validate data against schema
valid = Avro::Schema.validate(schema, data)

# Schema fingerprinting
fingerprint = Avro::Schema.fingerprint(schema)

Avro schema registry integration:

require 'avro'
require 'net/http'

class SchemaRegistry
  def initialize(registry_url)
    @registry_url = registry_url
  end
  
  def register_schema(subject, schema)
    uri = URI("#{@registry_url}/subjects/#{subject}/versions")
    request = Net::HTTP::Post.new(uri)
    request["Content-Type"] = "application/vnd.schemaregistry.v1+json"
    request.body = { schema: schema.to_s }.to_json
    
    response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
      http.request(request)
    end
    
    JSON.parse(response.body)["id"]
  end
  
  def get_schema(subject, version = "latest")
    uri = URI("#{@registry_url}/subjects/#{subject}/versions/#{version}")
    response = Net::HTTP.get_response(uri)
    schema_json = JSON.parse(response.body)["schema"]
    Avro::Schema.parse(schema_json)
  end
end

registry = SchemaRegistry.new("http://localhost:8081")
schema_id = registry.register_schema("user-events", schema)

Command-line Avro tools:

# avro-tools - Java-based toolkit
avro-tools getschema data.avro      # Extract schema
avro-tools tojson data.avro         # Convert to JSON
avro-tools fromjson --schema-file schema.avsc data.json

# avro-cli - Go-based toolkit
avro-cli schema data.avro           # Show schema
avro-cli cat data.avro              # Show data
avro-cli to-parquet data.avro output.parquet

Integration Libraries

File format conversion utilities:

# Convert CSV to Parquet
require 'csv'
require 'parquet'

def csv_to_parquet(csv_file, parquet_file)
  rows = CSV.read(csv_file, headers: true, converters: :numeric)
  
  # Build Arrow table from CSV
  columns = {}
  rows.headers.each do |header|
    columns[header] = rows[header]
  end
  
  table = Arrow::Table.new(columns)
  Arrow::Table.save(table, parquet_file, format: :parquet)
end

# Convert JSON to Avro
require 'json'
require 'avro'

def json_to_avro(json_file, avro_file, schema)
  data = JSON.parse(File.read(json_file))
  
  file = File.open(avro_file, "wb")
  writer = Avro::DataFile::Writer.new(
    file,
    Avro::IO::DatumWriter.new(schema),
    schema
  )
  
  data.each { |record| writer << record }
  writer.close
end

Database integration:

# PostgreSQL COPY for CSV
require 'pg'

conn = PG.connect(dbname: 'mydb')
conn.exec("COPY users FROM '/path/to/users.csv' WITH CSV HEADER")

# Export to CSV
conn.copy_data("COPY (SELECT * FROM users) TO STDOUT WITH CSV HEADER") do
  File.open("export.csv", "w") do |f|
    while row = conn.get_copy_data
      f.write(row)
    end
  end
end

# JSON columns in PostgreSQL
conn.exec("INSERT INTO events (data) VALUES ($1)", [data.to_json])
result = conn.exec("SELECT data FROM events WHERE data->>'user_id' = '123'")

Reference

Format Comparison Matrix

Characteristic CSV JSON Parquet Avro
Storage Model Row-based text Row-based text Columnar binary Row-based binary
Human Readable Yes Yes No No
Schema Required No No Yes Yes
Nested Data No Yes Yes Yes
Compression Efficiency Low Low High Medium
Read Performance Slow Slow Fast for column queries Fast for full rows
Write Performance Fast Fast Slower Fast
Query Optimization None None Column pruning, predicate pushdown Limited
Schema Evolution None Flexible Limited Full support
Streaming Support Yes Limited No Yes
Size for 1M Rows ~100MB ~150MB ~10-20MB ~30-40MB

CSV Processing Reference

Operation Method Notes
Read CSV CSV.read(path, options) Loads entire file into memory
Stream CSV CSV.foreach(path, options) Memory-efficient row-by-row
Parse String CSV.parse(string, options) Parse CSV from string
Generate CSV CSV.generate(options) Build CSV string
Write CSV CSV.open(path, "w") Write CSV file
Headers headers: true Parse with headers
Type Conversion converters: :numeric Convert types during parse
Custom Delimiter col_sep: "\t" Use tab or other separator
Quote Character quote_char: "'" Change quote character
Skip Blanks skip_blanks: true Ignore empty lines

JSON Processing Reference

Operation Method Notes
Parse JSON JSON.parse(string) Parse JSON string
Generate JSON JSON.generate(object) Ruby to JSON
Pretty Print JSON.pretty_generate(object) Formatted output
Symbol Keys symbolize_names: true Use symbols instead of strings
Custom Serialization define to_json Custom object encoding
Load from File JSON.load(file) Parse file directly
Streaming Parser JSON::Stream::Parser Memory-efficient parsing
Schema Validation JSON::Validator.validate! Validate against schema

Parquet Operations Reference

Operation Code Pattern Purpose
Load Table Arrow::Table.load(path, format: :parquet) Read Parquet file
Save Table Arrow::Table.save(table, path, format: :parquet) Write Parquet file
Column Projection columns: ["col1", "col2"] Read specific columns only
Compression compression: :snappy Set compression algorithm
Read Metadata Arrow::Table.metadata(path) Get file metadata
Row Count table.n_rows Number of rows
Column Count table.n_columns Number of columns
Schema table.schema Table schema
Filter Rows table.filter { condition } Row filtering
Group By table.group_by(key).aggregate Aggregation operations

Avro Operations Reference

Operation Code Pattern Purpose
Parse Schema Avro::Schema.parse(json) Load schema from JSON
Create Writer Avro::DataFile::Writer.new Initialize file writer
Write Record writer << record Add record to file
Create Reader Avro::DataFile::Reader.new Initialize file reader
Read Records reader.each { block } Iterate records
Schema Evolution DatumReader.new(writer_schema, reader_schema) Handle schema changes
Validate Data Avro::Schema.validate(schema, data) Check data validity
Schema Fingerprint Avro::Schema.fingerprint(schema) Schema hash

Type Mapping Reference

Ruby Type CSV JSON Parquet Avro
Integer "123" 123 INT32/INT64 int, long
Float "99.99" 99.99 FLOAT/DOUBLE float, double
String text "text" STRING string
Boolean "true" true BOOLEAN boolean
Date "2025-01-15" "2025-01-15" DATE int (days from epoch)
Timestamp "2025-01-15T10:30:00Z" "2025-01-15T10:30:00Z" TIMESTAMP long (milliseconds)
Decimal "123.45" 123.45 DECIMAL bytes (logical type)
Array "val1,val2" [val1, val2] LIST array
Object JSON string {key: val} STRUCT record
Null "" null NULL null (union type)

Compression Algorithms

Algorithm CSV JSON Parquet Avro Speed Ratio Use Case
None N/A N/A Default Default Fastest 1x Testing only
Gzip External External Supported Supported Slow 3-5x Maximum compression
Snappy External External Recommended Supported Fast 2-3x Balanced performance
Zstd External External Supported Supported Balanced 3-4x Modern compression
LZ4 External External Supported N/A Very Fast 2x High throughput

Performance Characteristics

Scenario Recommended Format Rationale
Small datasets under 10MB CSV or JSON Simplicity outweighs efficiency
Analytics queries Parquet Column pruning, compression
Full record access Avro or CSV Row-based reading efficient
Human inspection needed CSV or JSON Text formats readable
Schema changes frequent JSON or Avro Flexible or managed evolution
Storage cost critical Parquet Best compression ratios
Streaming required CSV or Avro Sequential read support
Cross-language exchange JSON Universal parser support
Type safety required Parquet or Avro Schema enforcement
Fast writes CSV or JSON Minimal encoding overhead