Overview
File formats define how data gets structured, serialized, and stored for persistence or transmission between systems. CSV, JSON, Parquet, and Avro represent different approaches to data serialization, each optimized for specific use cases and constraints.
CSV (Comma-Separated Values) stores tabular data in plain text with values separated by delimiters. Each line represents a row, with commas (or other characters) separating fields. CSV originated in early database and spreadsheet applications, becoming ubiquitous for data exchange due to its simplicity and human readability.
JSON (JavaScript Object Notation) serializes hierarchical data structures using human-readable text. Based on JavaScript object syntax, JSON represents data as key-value pairs, arrays, and nested objects. JSON became the dominant format for web APIs and configuration files due to its flexibility and language-independent parsing.
Parquet stores columnar data in a compressed, binary format optimized for analytical queries. Developed within the Apache Hadoop ecosystem, Parquet organizes data by columns rather than rows, enabling efficient compression and selective column reading for analytics workloads.
Avro serializes structured data in a compact binary format with schema evolution support. Created by Apache as part of the Hadoop ecosystem, Avro embeds schema information with data, enabling compatibility checks and schema migration across system versions.
These formats differ fundamentally in their data models, serialization approaches, and performance characteristics:
# CSV - simple tabular data
csv_data = <<~CSV
name,age,city
Alice,30,Boston
Bob,25,Seattle
CSV
# JSON - hierarchical structures
json_data = {
users: [
{ name: "Alice", age: 30, address: { city: "Boston" } },
{ name: "Bob", age: 25, address: { city: "Seattle" } }
]
}.to_json
# Parquet - columnar binary (typically handled by specialized libraries)
# Data stored in column chunks with compression and encoding
# Avro - row-based binary with schema
# Schema definition separate from data encoding
Format selection impacts system performance, storage costs, query patterns, and integration complexity. Text-based formats (CSV, JSON) prioritize human readability and interoperability at the cost of storage efficiency. Binary formats (Parquet, Avro) optimize for performance and schema management but require specialized tools for inspection.
Key Principles
Data serialization formats balance competing requirements: human readability versus machine efficiency, flexibility versus type safety, storage size versus query performance. Understanding these trade-offs guides format selection for specific use cases.
Schema Enforcement
CSV lacks formal schema definition. Column types get inferred during parsing, leading to ambiguity with numeric strings, dates, and null values. The parser guesses types based on value patterns:
require 'csv'
data = "id,value\n1,100\n2,abc\n3,\n"
parsed = CSV.parse(data, headers: true)
parsed[0]["value"] # "100" (String, not Integer)
parsed[2]["value"] # "" (empty string, not nil)
JSON provides type information for basic types (string, number, boolean, null, array, object) but lacks date, decimal, or custom type support. Type validation requires external schema languages like JSON Schema:
require 'json'
data = { timestamp: "2025-01-15", amount: 99.99 }
json_str = data.to_json # {"timestamp":"2025-01-15","amount":99.99}
parsed = JSON.parse(json_str)
parsed["timestamp"] # String - date information lost
parsed["amount"] # Float - precision maintained
Parquet embeds schema in file metadata, enforcing column types at write time. The schema defines logical types (date, decimal, timestamp) that map to physical storage types:
# Parquet schema definition (conceptual)
schema = {
fields: [
{ name: "id", type: "INT64", required: true },
{ name: "amount", type: "DECIMAL(10,2)", required: true },
{ name: "timestamp", type: "TIMESTAMP_MILLIS", required: false }
]
}
Avro requires explicit schema definition before serialization. The schema travels with the data or gets stored in a schema registry, enabling version compatibility checks:
# Avro schema definition
schema = {
type: "record",
name: "Transaction",
fields: [
{ name: "id", type: "long" },
{ name: "amount", type: { type: "bytes", logicalType: "decimal", precision: 10, scale: 2 } },
{ name: "timestamp", type: ["null", "long"], default: nil }
]
}
Data Model and Nesting
CSV represents flat, tabular data. Each row contains the same fields in the same order. Nested structures require flattening or encoding as strings:
# Nested data must be flattened
address = { street: "123 Main", city: "Boston" }
csv_row = ["Alice", "123 Main", "Boston"] # Flatten structure
# Or encode as JSON string
csv_row = ["Alice", '{"street":"123 Main","city":"Boston"}'] # Nested as string
JSON natively supports nested objects and arrays. Arbitrary nesting depth enables hierarchical data representation:
data = {
user: {
name: "Alice",
addresses: [
{ type: "home", city: "Boston" },
{ type: "work", city: "Cambridge" }
]
}
}
Parquet supports nested structures through repetition and definition levels. Nested columns get encoded as separate column chunks with metadata tracking nesting:
# Parquet nested schema
schema = {
fields: [
{ name: "user.name", type: "STRING" },
{ name: "user.addresses", type: "LIST", element: {
type: "STRUCT", fields: [
{ name: "type", type: "STRING" },
{ name: "city", type: "STRING" }
]
}}
]
}
Avro supports complex types (records, arrays, maps, unions) through schema composition. Nested schemas define hierarchical structures with type safety:
# Avro nested schema
schema = {
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "addresses", type: {
type: "array",
items: {
type: "record",
name: "Address",
fields: [
{ name: "type", type: "string" },
{ name: "city", type: "string" }
]
}
}}
]
}
Serialization Approach
CSV serializes row-by-row with delimiters separating values. Special characters require escaping through quoting:
require 'csv'
rows = [
["Name", "Description"],
["Product A", "Contains, comma"],
["Product B", 'Contains "quotes"']
]
csv_string = CSV.generate do |csv|
rows.each { |row| csv << row }
end
# Output:
# Name,Description
# Product A,"Contains, comma"
# Product B,"Contains ""quotes"""
JSON serializes to text with syntax rules for objects, arrays, strings, and primitives. Serialization transforms Ruby objects to JSON text representation:
require 'json'
data = {
products: [
{ name: "Product A", price: 10.50 },
{ name: "Product B", price: 20.00 }
]
}
json = JSON.generate(data)
# {"products":[{"name":"Product A","price":10.5},{"name":"Product B","price":20.0}]}
Parquet serializes data column-by-column, applying encoding and compression to each column independently. Values for a single column get stored contiguously:
# Conceptual column-wise storage
column_data = {
"name" => ["Alice", "Bob", "Carol"], # Stored together
"age" => [30, 25, 35], # Stored together
"city" => ["Boston", "Seattle", "Boston"] # Stored together, compressed
}
Avro serializes data row-by-row in compact binary format. The schema defines field order and encoding rules:
# Avro binary encoding (conceptual)
# Schema: { name: string, age: int }
# Data: { name: "Alice", age: 30 }
# Binary: [5, "Alice", 30] # 5 = string length
Schema Evolution
CSV lacks schema evolution mechanisms. Adding fields requires manual coordination and column position assumptions:
# Original CSV
original = "name,age\nAlice,30\n"
# Adding field requires coordination
evolved = "name,age,city\nAlice,30,Boston\n" # New field added
legacy = "name,age\nBob,25\n" # Old format incompatible
JSON handles field additions gracefully through object structure. Parsers ignore unknown fields, and missing fields default to nil:
require 'json'
old_format = { name: "Alice", age: 30 }.to_json
new_format = { name: "Bob", age: 25, city: "Seattle" }.to_json
# Parser handles both
JSON.parse(old_format) # Missing city field
JSON.parse(new_format) # Additional city field
Parquet supports schema evolution through column projection. Readers can request specific columns, ignoring new or missing fields:
# Write with schema v1: [name, age]
# Write with schema v2: [name, age, city]
# Reader can request [name, age] from both files
Avro provides explicit schema evolution rules. Reader and writer schemas can differ if compatible according to resolution rules:
# Writer schema v1
writer_schema = {
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "age", type: "int" }
]
}
# Reader schema v2 (compatible evolution)
reader_schema = {
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "age", type: "int" },
{ name: "city", type: "string", default: "Unknown" } # New field with default
]
}
Ruby Implementation
Ruby provides built-in support for CSV and JSON through standard library modules. Binary formats require external gems with native extensions for performance.
CSV Processing
Ruby's CSV library handles reading, writing, and parsing CSV data. The library supports various options for delimiters, quoting, and encoding:
require 'csv'
# Writing CSV
CSV.open("output.csv", "w") do |csv|
csv << ["Name", "Age", "City"]
csv << ["Alice", 30, "Boston"]
csv << ["Bob", 25, "Seattle"]
end
# Reading CSV with headers
data = CSV.read("output.csv", headers: true)
data.each do |row|
puts "#{row['Name']} is #{row['Age']} years old"
end
# Parsing CSV string
csv_string = "name,age\nAlice,30\nBob,25"
parsed = CSV.parse(csv_string, headers: true, header_converters: :symbol)
parsed.first[:name] # "Alice"
parsed.first[:age] # "30" (still a string)
The CSV library provides converters for automatic type conversion:
require 'csv'
# Define custom converter
CSV::Converters[:blank_to_nil] = ->(value) {
value && value.empty? ? nil : value
}
data = CSV.parse(<<~CSV, headers: true, converters: [:numeric, :blank_to_nil])
name,age,score
Alice,30,95.5
Bob,,
Carol,25,88.0
CSV
data[0]["age"] # 30 (Integer)
data[0]["score"] # 95.5 (Float)
data[1]["age"] # nil (converted from empty)
CSV options control parsing behavior for edge cases:
require 'csv'
# Custom delimiter and quote character
data = CSV.parse("name|age\nAlice|30", col_sep: "|", headers: true)
# Liberal parsing for malformed CSV
malformed = "name,age\nAlice,30,extra\nBob"
CSV.parse(malformed, headers: true, liberal_parsing: true)
# Skip blank lines
csv_with_blanks = "name,age\n\nAlice,30\n\nBob,25"
CSV.parse(csv_with_blanks, headers: true, skip_blanks: true)
JSON Serialization
Ruby's JSON library handles encoding and decoding JSON data. The library integrates with Ruby's object system through to_json and from_json methods:
require 'json'
# Encoding Ruby objects to JSON
data = {
users: [
{ name: "Alice", age: 30, active: true },
{ name: "Bob", age: 25, active: false }
]
}
json_string = JSON.generate(data)
# or using to_json
json_string = data.to_json
# Pretty printing
puts JSON.pretty_generate(data)
# {
# "users": [
# {
# "name": "Alice",
# "age": 30,
# "active": true
# }
# ]
# }
JSON parsing converts JSON strings to Ruby data structures:
require 'json'
json_string = '{"name":"Alice","scores":[95,88,92],"metadata":null}'
# Parse JSON
data = JSON.parse(json_string)
data["name"] # "Alice"
data["scores"] # [95, 88, 92]
data["metadata"] # nil
# Parse with symbol keys
data = JSON.parse(json_string, symbolize_names: true)
data[:name] # "Alice"
# Streaming parser for large files
File.open("large.json") do |file|
JSON.load(file)
end
Custom JSON serialization for Ruby objects:
require 'json'
class User
attr_accessor :name, :age, :created_at
def initialize(name, age)
@name = name
@age = age
@created_at = Time.now
end
def to_json(*args)
{
name: @name,
age: @age,
created_at: @created_at.iso8601
}.to_json(*args)
end
def self.from_json(json_string)
data = JSON.parse(json_string)
user = new(data["name"], data["age"])
user.created_at = Time.parse(data["created_at"])
user
end
end
user = User.new("Alice", 30)
json = user.to_json # Custom serialization
restored = User.from_json(json)
Parquet Integration
Ruby requires the red-parquet gem for Parquet file support. Red-parquet builds on Apache Arrow for efficient columnar data handling:
require 'parquet'
# Writing Parquet files
table = Arrow::Table.new(
name: ["Alice", "Bob", "Carol"],
age: [30, 25, 35],
salary: [75000.0, 68000.0, 82000.0]
)
Arrow::Table.save(table, "employees.parquet", format: :parquet)
# Reading Parquet files
table = Arrow::Table.load("employees.parquet", format: :parquet)
# Access columns
table.column("name").to_a # ["Alice", "Bob", "Carol"]
table.column("age").to_a # [30, 25, 35]
# Filter rows
filtered = table.filter { |row| row["age"] > 25 }
Parquet schema definition and type mapping:
require 'parquet'
# Define schema with specific types
schema = Arrow::Schema.new([
Arrow::Field.new("id", :int64),
Arrow::Field.new("name", :string),
Arrow::Field.new("amount", Arrow::Decimal128DataType.new(10, 2)),
Arrow::Field.new("timestamp", :timestamp, unit: :millisecond)
])
# Build table with schema
builder = Arrow::RecordBatchBuilder.new(schema)
builder.append([
[1, 2, 3],
["Alice", "Bob", "Carol"],
[BigDecimal("99.99"), BigDecimal("150.50"), BigDecimal("200.00")],
[Time.now, Time.now - 3600, Time.now + 3600]
])
table = Arrow::Table.new(builder.flush)
Arrow::Table.save(table, "transactions.parquet", format: :parquet)
Parquet reading with column projection and filtering:
require 'parquet'
# Read specific columns only
table = Arrow::Table.load(
"employees.parquet",
format: :parquet,
columns: ["name", "salary"] # Only load these columns
)
# Row group filtering (predicate pushdown)
# Parquet stores data in row groups - filtering at row group level
# avoids reading unnecessary data
table = Arrow::Table.load("employees.parquet", format: :parquet)
high_earners = table.filter { |row| row["salary"] > 70000 }
Avro Processing
Ruby uses the avro gem for Avro serialization. Avro requires schema definition before encoding or decoding data:
require 'avro'
# Define Avro schema
schema_json = {
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "age", type: "int" },
{ name: "email", type: ["null", "string"], default: nil }
]
}.to_json
schema = Avro::Schema.parse(schema_json)
# Write Avro data
file = File.open("users.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(schema), schema)
writer << { "name" => "Alice", "age" => 30, "email" => "alice@example.com" }
writer << { "name" => "Bob", "age" => 25, "email" => nil }
writer.close
# Read Avro data
file = File.open("users.avro", "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
reader.each do |record|
puts "#{record['name']}: #{record['age']}"
end
reader.close
Avro schema evolution with compatible changes:
require 'avro'
# Writer schema (v1)
writer_schema = Avro::Schema.parse({
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "age", type: "int" }
]
}.to_json)
# Reader schema (v2) - adds optional field
reader_schema = Avro::Schema.parse({
type: "record",
name: "User",
fields: [
{ name: "name", type: "string" },
{ name: "age", type: "int" },
{ name: "city", type: "string", default: "Unknown" }
]
}.to_json)
# Write with v1 schema
file = File.open("users_v1.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(writer_schema), writer_schema)
writer << { "name" => "Alice", "age" => 30 }
writer.close
# Read with v2 schema - default value applied
file = File.open("users_v1.avro", "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new(writer_schema, reader_schema))
record = reader.first
record["city"] # "Unknown" (default value)
reader.close
Avro encoding with complex types:
require 'avro'
# Complex schema with nested types
schema = Avro::Schema.parse({
type: "record",
name: "Order",
fields: [
{ name: "order_id", type: "long" },
{ name: "items", type: {
type: "array",
items: {
type: "record",
name: "Item",
fields: [
{ name: "product", type: "string" },
{ name: "quantity", type: "int" },
{ name: "price", type: "double" }
]
}
}},
{ name: "metadata", type: {
type: "map",
values: "string"
}}
]
}.to_json)
# Write complex data
file = File.open("orders.avro", "wb")
writer = Avro::DataFile::Writer.new(file, Avro::IO::DatumWriter.new(schema), schema)
writer << {
"order_id" => 12345,
"items" => [
{ "product" => "Widget", "quantity" => 2, "price" => 19.99 },
{ "product" => "Gadget", "quantity" => 1, "price" => 49.99 }
],
"metadata" => { "source" => "web", "campaign" => "spring2025" }
}
writer.close
Design Considerations
Format selection depends on data characteristics, access patterns, system constraints, and interoperability requirements. Each format addresses different priorities in the data management spectrum.
When to Use CSV
CSV works for simple tabular data requiring human readability and universal compatibility. Spreadsheet tools, database import/export, and data exchange between heterogeneous systems favor CSV due to minimal tooling requirements.
Use CSV when:
# Simple tabular export for business users
require 'csv'
def export_sales_report(sales_data)
CSV.generate(headers: true) do |csv|
csv << ["Date", "Product", "Quantity", "Revenue"]
sales_data.each do |sale|
csv << [sale.date, sale.product, sale.quantity, sale.revenue]
end
end
end
# Easy inspection and manual editing
# Users can open in Excel, Google Sheets, or text editor
Avoid CSV when:
- Data contains nested structures (addresses, line items, hierarchies)
- Precise type information matters (dates, decimals, nulls versus empty strings)
- File size exceeds several hundred megabytes (parsing becomes slow)
- Schema changes frequently (column additions break parsers)
- Data contains many special characters requiring escaping
When to Use JSON
JSON handles hierarchical data structures with moderate nesting and heterogeneous schemas. REST APIs, configuration files, and document stores rely on JSON for flexible data representation without strict schema requirements.
Use JSON when:
require 'json'
# API response with nested data
api_response = {
user: {
id: 12345,
name: "Alice Smith",
preferences: {
theme: "dark",
notifications: ["email", "sms"]
}
},
orders: [
{
id: 1001,
items: [
{ product: "Widget", qty: 2 },
{ product: "Gadget", qty: 1 }
],
total: 89.97
}
]
}.to_json
# Configuration file with nested sections
config = {
database: {
host: "localhost",
port: 5432,
pool: { min: 5, max: 20 }
},
cache: {
enabled: true,
backend: "redis",
ttl: 3600
}
}
File.write("config.json", JSON.pretty_generate(config))
Avoid JSON when:
- Data volume exceeds hundreds of megabytes (parsing becomes memory-intensive)
- Analytics queries need to scan specific columns (row-oriented format inefficient)
- Binary data dominates (base64 encoding inflates size)
- Schema enforcement prevents invalid data (JSON lacks type validation)
- Numeric precision matters (IEEE 754 double precision limitations)
When to Use Parquet
Parquet optimizes for analytical queries over large datasets with repeated column access. Data warehouses, analytics platforms, and batch processing pipelines benefit from columnar storage and efficient compression.
Use Parquet when:
require 'parquet'
# Analytics query reading specific columns
def analyze_sales(file_path)
# Parquet reads only needed columns from disk
table = Arrow::Table.load(
file_path,
format: :parquet,
columns: ["product_id", "revenue", "timestamp"] # Skip other columns
)
# Aggregate by product
table.group_by("product_id").aggregate("revenue" => ["sum", "avg"])
end
# Large dataset with high compression
# 10GB CSV might compress to 1GB Parquet with dictionary encoding
# and delta compression on sorted columns
Avoid Parquet when:
- Row-based access dominates (retrieving full records repeatedly)
- Real-time updates required (Parquet files immutable after writing)
- Human inspection needed (binary format requires specialized tools)
- Small datasets under 10MB (overhead not justified)
- Stream processing with record-by-record operations
When to Use Avro
Avro supports schema evolution in streaming systems and data pipelines with version compatibility requirements. Kafka message serialization, data lake ingestion, and RPC systems use Avro for schema management across service versions.
Use Avro when:
require 'avro'
# Message serialization with schema evolution
class EventProducer
def initialize(schema_registry)
@schema_registry = schema_registry
@schema = load_latest_schema("user_event")
end
def produce_event(event_data)
# Schema versioning ensures compatibility
writer = Avro::IO::DatumWriter.new(@schema)
buffer = StringIO.new
encoder = Avro::IO::BinaryEncoder.new(buffer)
writer.write(event_data, encoder)
{
schema_id: @schema.version,
payload: buffer.string
}
end
end
# Schema evolution with backward/forward compatibility
# Old readers can process new data (backward compatible)
# New readers can process old data (forward compatible)
Avoid Avro when:
- Schema changes infrequently or never (overhead not justified)
- Human readability matters for debugging (binary format opaque)
- Ad-hoc queries dominate (columnar format better)
- Minimal tooling requirements (CSV/JSON simpler)
- Small message sizes where overhead matters (protobuf more compact)
Hybrid Approaches
Real systems often combine formats based on data lifecycle stages:
# Ingestion: CSV/JSON for compatibility
# → Transform to Parquet for analytics storage
# → Avro for streaming between services
# → JSON for API responses
class DataPipeline
def process_upload(csv_file)
# Read CSV input
data = CSV.read(csv_file, headers: true)
# Transform and validate
records = data.map { |row| transform_row(row) }
# Write to Parquet for analytics
table = build_arrow_table(records)
Arrow::Table.save(table, "analytics.parquet", format: :parquet)
# Stream events to Kafka in Avro
records.each { |record| publish_avro_event(record) }
# Return JSON response
{ processed: records.size, status: "complete" }.to_json
end
end
Practical Examples
Real-world usage demonstrates format characteristics, performance trade-offs, and integration patterns across different scenarios.
CSV Data Processing Pipeline
Processing large CSV files requires streaming to avoid memory exhaustion:
require 'csv'
# Stream large CSV file without loading into memory
def process_large_csv(input_file, output_file)
errors = []
processed = 0
CSV.open(output_file, "w", headers: true) do |output|
CSV.foreach(input_file, headers: true).with_index do |row, idx|
begin
# Validate and transform
validated = validate_row(row, idx)
output << validated if validated
processed += 1
rescue ValidationError => e
errors << { line: idx + 1, error: e.message }
end
# Progress reporting for large files
puts "Processed #{processed} records" if (processed % 10000).zero?
end
end
{ processed: processed, errors: errors }
end
def validate_row(row, line_number)
# Type conversion and validation
age = Integer(row["age"]) rescue raise ValidationError, "Invalid age"
email = row["email"]
raise ValidationError, "Missing email" if email.nil? || email.empty?
[row["name"], age, email.downcase]
rescue ArgumentError => e
raise ValidationError, "Line #{line_number}: #{e.message}"
end
class ValidationError < StandardError; end
CSV aggregation with grouping and statistics:
require 'csv'
def analyze_sales_csv(file_path)
sales_by_product = Hash.new { |h, k| h[k] = { qty: 0, revenue: 0.0 } }
CSV.foreach(file_path, headers: true, converters: :numeric) do |row|
product = row["product"]
sales_by_product[product][:qty] += row["quantity"]
sales_by_product[product][:revenue] += row["price"] * row["quantity"]
end
# Calculate averages and format output
results = sales_by_product.map do |product, data|
{
product: product,
total_quantity: data[:qty],
total_revenue: data[:revenue],
avg_price: data[:revenue] / data[:qty]
}
end
# Sort by revenue descending
results.sort_by { |r| -r[:total_revenue] }
end
JSON API Integration
Building and consuming JSON APIs with proper error handling:
require 'json'
require 'net/http'
require 'uri'
class APIClient
def initialize(base_url, api_key)
@base_url = base_url
@api_key = api_key
end
def get_user(user_id)
uri = URI("#{@base_url}/users/#{user_id}")
request = Net::HTTP::Get.new(uri)
request["Authorization"] = "Bearer #{@api_key}"
request["Content-Type"] = "application/json"
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request)
end
handle_response(response)
end
def create_user(user_data)
uri = URI("#{@base_url}/users")
request = Net::HTTP::Post.new(uri)
request["Authorization"] = "Bearer #{@api_key}"
request["Content-Type"] = "application/json"
request.body = user_data.to_json
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request)
end
handle_response(response)
end
private
def handle_response(response)
case response.code.to_i
when 200..299
JSON.parse(response.body, symbolize_names: true)
when 400..499
error_data = JSON.parse(response.body) rescue {}
raise ClientError, "#{response.code}: #{error_data['message']}"
when 500..599
raise ServerError, "Server error: #{response.code}"
else
raise APIError, "Unexpected response: #{response.code}"
end
rescue JSON::ParserError => e
raise APIError, "Invalid JSON response: #{e.message}"
end
end
class APIError < StandardError; end
class ClientError < APIError; end
class ServerError < APIError; end
# Usage
client = APIClient.new("https://api.example.com", "secret_key")
user = client.get_user(12345)
puts user[:name]
Nested JSON transformation and flattening:
require 'json'
def flatten_json(nested_hash, parent_key = nil, result = {})
nested_hash.each do |key, value|
new_key = parent_key ? "#{parent_key}.#{key}" : key.to_s
if value.is_a?(Hash)
flatten_json(value, new_key, result)
elsif value.is_a?(Array)
value.each_with_index do |item, idx|
if item.is_a?(Hash) || item.is_a?(Array)
flatten_json(item, "#{new_key}[#{idx}]", result)
else
result["#{new_key}[#{idx}]"] = item
end
end
else
result[new_key] = value
end
end
result
end
# Example usage
nested = {
user: {
id: 123,
profile: {
name: "Alice",
contacts: [
{ type: "email", value: "alice@example.com" },
{ type: "phone", value: "555-0100" }
]
}
}
}
flattened = flatten_json(nested)
# {
# "user.id" => 123,
# "user.profile.name" => "Alice",
# "user.profile.contacts[0].type" => "email",
# "user.profile.contacts[0].value" => "alice@example.com",
# "user.profile.contacts[1].type" => "phone",
# "user.profile.contacts[1].value" => "555-0100"
# }
Parquet Analytics Workflow
Building analytics tables from multiple sources:
require 'parquet'
class AnalyticsBuilder
def build_user_analytics(user_events_path, user_profiles_path, output_path)
# Load event data
events = Arrow::Table.load(user_events_path, format: :parquet)
profiles = Arrow::Table.load(user_profiles_path, format: :parquet)
# Join tables on user_id
joined = join_tables(events, profiles, "user_id")
# Aggregate metrics
analytics = joined.group_by("user_id", "signup_date").aggregate(
"event_count" => ["count"],
"page_views" => ["sum"],
"session_duration" => ["sum", "avg"]
)
# Add computed columns
enriched = add_computed_columns(analytics)
# Write output with compression
Arrow::Table.save(
enriched,
output_path,
format: :parquet,
compression: :snappy
)
end
private
def join_tables(left, right, key)
# Simplified join logic
left_data = table_to_hash(left, key)
right_data = table_to_hash(right, key)
joined_rows = left_data.map do |k, left_row|
right_row = right_data[k] || {}
left_row.merge(right_row)
end
hash_to_table(joined_rows)
end
def add_computed_columns(table)
# Add engagement score based on metrics
rows = table.each_row.map do |row|
score = calculate_engagement_score(
row["event_count"],
row["page_views"],
row["session_duration_avg"]
)
row.to_h.merge("engagement_score" => score)
end
hash_to_table(rows)
end
def calculate_engagement_score(events, views, avg_duration)
(events * 0.3 + views * 0.5 + avg_duration / 60 * 0.2).round(2)
end
end
Parquet partitioning for efficient queries:
require 'parquet'
class PartitionedWriter
def write_partitioned_data(data, base_path, partition_keys)
# Group data by partition keys
partitions = Hash.new { |h, k| h[k] = [] }
data.each do |row|
partition_values = partition_keys.map { |key| row[key] }
partitions[partition_values] << row
end
# Write each partition to separate file
partitions.each do |partition_values, rows|
partition_path = build_partition_path(base_path, partition_keys, partition_values)
FileUtils.mkdir_p(File.dirname(partition_path))
table = build_table(rows)
Arrow::Table.save(table, partition_path, format: :parquet)
end
end
def build_partition_path(base, keys, values)
partition_dirs = keys.zip(values).map { |k, v| "#{k}=#{v}" }
File.join(base, *partition_dirs, "data.parquet")
end
end
# Usage: partition by year and month
writer = PartitionedWriter.new
writer.write_partitioned_data(
events,
"/data/events",
["year", "month"]
)
# Creates structure:
# /data/events/year=2025/month=01/data.parquet
# /data/events/year=2025/month=02/data.parquet
Avro Event Streaming
Producing and consuming Avro events with schema evolution:
require 'avro'
class EventStream
def initialize(schema_path)
@schema = Avro::Schema.parse(File.read(schema_path))
end
def produce_event(event_data, output_file)
file = File.open(output_file, "ab") # Append mode
writer = Avro::DataFile::Writer.new(
file,
Avro::IO::DatumWriter.new(@schema),
@schema
)
# Validate event against schema
validate_event(event_data)
writer << event_data
writer.close
end
def consume_events(input_file)
file = File.open(input_file, "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
reader.each do |event|
yield event
end
reader.close
end
private
def validate_event(event_data)
@schema.fields.each do |field|
value = event_data[field.name]
if field.default.nil? && value.nil? && !field.type.respond_to?(:types)
raise ValidationError, "Required field #{field.name} missing"
end
end
end
end
# Event processor with schema evolution
class EventProcessor
def initialize(reader_schema_path, writer_schema_path)
@reader_schema = Avro::Schema.parse(File.read(reader_schema_path))
@writer_schema = Avro::Schema.parse(File.read(writer_schema_path))
end
def process_file(input_file, output_file)
input = File.open(input_file, "rb")
output = File.open(output_file, "wb")
reader = Avro::DataFile::Reader.new(
input,
Avro::IO::DatumReader.new(@reader_schema, @writer_schema)
)
writer = Avro::DataFile::Writer.new(
output,
Avro::IO::DatumWriter.new(@writer_schema),
@writer_schema
)
reader.each do |event|
transformed = transform_event(event)
writer << transformed
end
reader.close
writer.close
end
private
def transform_event(event)
# Transform from old schema to new schema
# Apply business logic, enrichment, etc.
event.merge(
"processed_at" => Time.now.to_i,
"version" => 2
)
end
end
Performance Considerations
File format performance varies across dimensions: serialization speed, storage size, query efficiency, and memory usage. Understanding these trade-offs guides format selection for specific workloads.
Storage Size and Compression
CSV stores data in plain text with minimal compression opportunities. Each value serialized as human-readable text inflates storage:
# CSV storage characteristics
csv_data = <<~CSV
user_id,timestamp,event_type,page_url
123456,2025-01-15T10:30:00Z,page_view,https://example.com/products/widget-2000
123456,2025-01-15T10:31:00Z,page_view,https://example.com/products/widget-3000
123456,2025-01-15T10:32:00Z,click,https://example.com/products/widget-2000/buy
CSV
# Storage: ~230 bytes for 3 rows
# Integer IDs stored as text: "123456" = 6 bytes vs 4 bytes binary
# Repeated URLs fully written each time
# Column names repeated as headers
JSON adds structural overhead with syntax characters and key repetition:
json_data = [
{
user_id: 123456,
timestamp: "2025-01-15T10:30:00Z",
event_type: "page_view",
page_url: "https://example.com/products/widget-2000"
},
{
user_id: 123456,
timestamp: "2025-01-15T10:31:00Z",
event_type: "page_view",
page_url: "https://example.com/products/widget-3000"
}
].to_json
# Storage: ~380 bytes for 2 events
# Key names repeated for each object
# Brackets, braces, quotes add overhead
# Numbers stored as text in JSON
Parquet applies columnar compression and encoding:
# Parquet storage optimizations
# Dictionary encoding for repeated values:
# event_type column: ["page_view", "click"]
# Store: dictionary + indices
# "page_view" appears 1000 times → stored once + 1000 indices
#
# Run-length encoding for sequential values:
# user_id: [123456, 123456, 123456, 123457, 123457]
# Store: [(123456, count=3), (123457, count=2)]
#
# Delta encoding for sorted values:
# timestamps: [1000, 1001, 1002, 1003]
# Store: 1000 + [1, 1, 1]
#
# Result: 10GB CSV → 1-2GB Parquet typical compression ratio
Avro binary encoding reduces size compared to text formats:
# Avro storage characteristics
# Compact binary encoding:
# Integer 123456 → variable-length encoding (3 bytes)
# String "page_view" → length + bytes
# Schema stored once, not repeated per record
#
# Block compression:
# Multiple records compressed together
# Snappy/Deflate compression on blocks
#
# Result: Smaller than JSON, larger than Parquet
# Good balance of size and random access
Read Performance
CSV requires full file scanning for queries. Parsing converts text to objects:
require 'csv'
require 'benchmark'
# CSV full scan required for filtering
def find_user_events_csv(file_path, user_id)
results = []
CSV.foreach(file_path, headers: true) do |row|
results << row if row["user_id"] == user_id.to_s
end
results
end
# Benchmark: CSV parsing overhead
Benchmark.measure do
CSV.read("large_file.csv", headers: true, converters: :numeric)
end
# Time dominated by:
# - Text parsing (splitting, unescaping)
# - Type conversion (string → numeric)
# - Full file scan (no indexing)
JSON requires deserializing entire objects:
require 'json'
# JSON reading characteristics
def find_user_events_json(file_path, user_id)
data = JSON.parse(File.read(file_path))
data.select { |event| event["user_id"] == user_id }
end
# Performance issues:
# - Must parse entire file into memory
# - Object construction overhead
# - No predicate pushdown
# - No column projection
Parquet enables column pruning and predicate pushdown:
require 'parquet'
# Parquet columnar reading
def find_user_events_parquet(file_path, user_id, needed_columns)
table = Arrow::Table.load(
file_path,
format: :parquet,
columns: needed_columns # Only read these columns from disk
)
table.filter { |row| row["user_id"] == user_id }
end
# Performance advantages:
# - Skip unused columns entirely (I/O savings)
# - Row group filtering via statistics
# - Efficient compression maintains good I/O
# - Vectorized processing in Arrow
Avro supports sequential streaming with schema projection:
require 'avro'
# Avro streaming read
def process_events_streaming(file_path)
file = File.open(file_path, "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
reader.each do |event|
# Process one record at a time
# Memory efficient for large files
yield event
end
reader.close
end
# Performance characteristics:
# - Low memory footprint (streaming)
# - Fast binary deserialization
# - Schema reading enables projection
# - Block-level compression
Write Performance
CSV writing appends text with escaping:
require 'csv'
require 'benchmark'
# CSV write performance
Benchmark.measure do
CSV.open("output.csv", "w") do |csv|
10_000.times do |i|
csv << [i, "user_#{i}", Time.now.iso8601, rand(100)]
end
end
end
# Bottlenecks:
# - String formatting for each value
# - Quote escaping checks
# - Newline handling
# - No batch optimization
JSON serialization constructs syntax:
require 'json'
# JSON write performance
Benchmark.measure do
data = 10_000.times.map do |i|
{
id: i,
name: "user_#{i}",
timestamp: Time.now.iso8601,
score: rand(100)
}
end
File.write("output.json", JSON.generate(data))
end
# Overhead:
# - Object traversal
# - Syntax character insertion
# - String escaping
# - Full serialization before write
Parquet batched writes with compression:
require 'parquet'
# Parquet write performance
Benchmark.measure do
rows = 10_000.times.map do |i|
[i, "user_#{i}", Time.now, rand(100)]
end
table = Arrow::Table.new(
id: rows.map(&:first),
name: rows.map { |r| r[1] },
timestamp: rows.map { |r| r[2] },
score: rows.map { |r| r[3] }
)
Arrow::Table.save(table, "output.parquet", format: :parquet, compression: :snappy)
end
# Optimizations:
# - Columnar layout enables vectorization
# - Batch encoding (dictionary, RLE)
# - Parallel column compression
# - Efficient binary encoding
Avro buffered writing:
require 'avro'
# Avro write with sync intervals
file = File.open("output.avro", "wb")
writer = Avro::DataFile::Writer.new(file, datum_writer, schema)
10_000.times do |i|
writer << { id: i, name: "user_#{i}", score: rand(100) }
# Sync every 1000 records (configurable)
writer.sync if (i % 1000).zero?
end
writer.close
# Performance features:
# - Buffered writes reduce I/O
# - Block compression
# - Configurable sync markers
# - Fast binary encoding
Memory Usage
CSV streaming enables low memory processing:
require 'csv'
# Memory-efficient CSV processing
def process_large_csv(file_path)
CSV.foreach(file_path, headers: true) do |row|
# Process one row at a time
# Memory usage constant regardless of file size
process_row(row)
end
end
# Memory: O(1) per row, independent of file size
JSON requires loading full document:
require 'json'
# JSON memory usage
def process_json(file_path)
data = JSON.parse(File.read(file_path)) # Load entire file
data.each { |item| process_item(item) }
end
# Memory: O(n) where n is data size
# Risk: OutOfMemory for large files
Parquet with Arrow uses memory mapping:
require 'parquet'
# Parquet memory-mapped reading
def process_parquet_columns(file_path, columns)
table = Arrow::Table.load(file_path, format: :parquet, columns: columns)
# Arrow uses memory mapping when possible
# Data accessed on-demand from disk
table.each_row { |row| process_row(row) }
end
# Memory: Depends on column selection and row group size
# More efficient than loading all data
Avro supports streaming iteration:
require 'avro'
# Avro streaming reduces memory
def process_avro_stream(file_path)
file = File.open(file_path, "rb")
reader = Avro::DataFile::Reader.new(file, Avro::IO::DatumReader.new)
reader.each do |record|
process_record(record)
end
reader.close
end
# Memory: O(1) per record + compression buffer
Tools & Ecosystem
Each file format has supporting libraries, command-line tools, and integration points within the Ruby ecosystem and broader data infrastructure.
CSV Tools
Ruby standard library provides comprehensive CSV support:
require 'csv'
# CSV module features
CSV::Converters # Built-in type converters
CSV::HeaderConverters # Header transformation
# Custom converter registration
CSV::Converters[:money] = ->(value) {
value.gsub(/[$,]/, '').to_f
}
CSV.parse("$1,234.56", converters: [:money])
External gems extend CSV functionality:
# smarter_csv - enhanced parsing with chunking
require 'smarter_csv'
options = {
chunk_size: 1000,
key_mapping: { 'User Name' => :user_name },
remove_empty_values: true
}
SmarterCSV.process('large.csv', options) do |chunk|
chunk.each { |row| process_row(row) }
end
# roo - reading Excel, OpenOffice, CSV
require 'roo'
xlsx = Roo::Spreadsheet.open('data.xlsx')
csv = Roo::CSV.new('data.csv')
csv.each_row { |row| puts row.inspect }
Command-line tools for CSV manipulation:
# csvkit - command-line CSV toolkit
csvstat data.csv # Column statistics
csvcut -c name,age data.csv # Select columns
csvgrep -c age -m 30 data.csv # Filter rows
csvsql --query "SELECT * FROM data WHERE age > 30" data.csv
# xsv - fast CSV toolkit in Rust
xsv select name,age data.csv # Column selection
xsv stats data.csv # Statistics
xsv join id data1.csv id data2.csv # Join files
JSON Tools
Ruby JSON library and extensions:
require 'json'
# Fast JSON parsing with Oj gem
require 'oj'
# Oj (Optimized JSON) - faster than standard library
data = Oj.load(json_string)
json = Oj.dump(data)
# Oj modes
Oj.load(json_string, mode: :strict) # Strict JSON
Oj.load(json_string, mode: :compat) # Compatible with JSON gem
Oj.load(json_string, mode: :rails) # Rails-specific
# MultiJson - abstraction over JSON libraries
require 'multi_json'
MultiJson.load(json_string)
MultiJson.dump(data)
MultiJson.engine = :oj # Use Oj backend
JSON Schema validation:
require 'json-schema'
schema = {
"type" => "object",
"required" => ["name", "age"],
"properties" => {
"name" => { "type" => "string" },
"age" => { "type" => "integer", "minimum" => 0 }
}
}
data = { "name" => "Alice", "age" => 30 }
JSON::Validator.validate!(schema, data) # Raises if invalid
Command-line JSON tools:
# jq - JSON processor
cat data.json | jq '.users[] | select(.age > 30)'
cat data.json | jq '[.users[].name]'
cat data.json | jq '.users | group_by(.city)'
# fx - terminal JSON viewer
fx data.json
# json2csv - convert JSON to CSV
json2csv -i data.json -o data.csv -k name,age,city
Parquet Tools
Ruby Parquet support through red-parquet gem:
require 'parquet'
# red-parquet builds on Apache Arrow
table = Arrow::Table.load("data.parquet", format: :parquet)
# Access Arrow functionality
table.n_rows
table.n_columns
table.schema
table.columns
# Table operations
filtered = table.filter { |row| row["age"] > 30 }
sorted = table.sort_by { |a, b| a["age"] <=> b["age"] }
Parquet metadata inspection:
require 'parquet'
# Read Parquet metadata
metadata = Arrow::Table.metadata("data.parquet")
# Schema information
schema = Arrow::Table.schema("data.parquet")
schema.fields.each do |field|
puts "#{field.name}: #{field.data_type}"
end
# File statistics
file_metadata = Arrow::ParquetFileReader.new("data.parquet")
file_metadata.num_row_groups
file_metadata.num_rows
file_metadata.metadata
Command-line Parquet tools:
# parquet-tools - Java-based toolkit
parquet-tools schema data.parquet # Show schema
parquet-tools meta data.parquet # Show metadata
parquet-tools cat data.parquet # Show data
parquet-tools head -n 10 data.parquet
# parquet-cli - command-line interface
parquet schema data.parquet
parquet cat --columns name,age data.parquet
parquet to-avro data.parquet output.avro
# DuckDB - query Parquet files with SQL
duckdb -c "SELECT * FROM 'data.parquet' WHERE age > 30"
Avro Tools
Ruby Avro gem:
require 'avro'
# Schema parsing and validation
schema_json = File.read("schema.avsc")
schema = Avro::Schema.parse(schema_json)
# Validate data against schema
valid = Avro::Schema.validate(schema, data)
# Schema fingerprinting
fingerprint = Avro::Schema.fingerprint(schema)
Avro schema registry integration:
require 'avro'
require 'net/http'
class SchemaRegistry
def initialize(registry_url)
@registry_url = registry_url
end
def register_schema(subject, schema)
uri = URI("#{@registry_url}/subjects/#{subject}/versions")
request = Net::HTTP::Post.new(uri)
request["Content-Type"] = "application/vnd.schemaregistry.v1+json"
request.body = { schema: schema.to_s }.to_json
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
http.request(request)
end
JSON.parse(response.body)["id"]
end
def get_schema(subject, version = "latest")
uri = URI("#{@registry_url}/subjects/#{subject}/versions/#{version}")
response = Net::HTTP.get_response(uri)
schema_json = JSON.parse(response.body)["schema"]
Avro::Schema.parse(schema_json)
end
end
registry = SchemaRegistry.new("http://localhost:8081")
schema_id = registry.register_schema("user-events", schema)
Command-line Avro tools:
# avro-tools - Java-based toolkit
avro-tools getschema data.avro # Extract schema
avro-tools tojson data.avro # Convert to JSON
avro-tools fromjson --schema-file schema.avsc data.json
# avro-cli - Go-based toolkit
avro-cli schema data.avro # Show schema
avro-cli cat data.avro # Show data
avro-cli to-parquet data.avro output.parquet
Integration Libraries
File format conversion utilities:
# Convert CSV to Parquet
require 'csv'
require 'parquet'
def csv_to_parquet(csv_file, parquet_file)
rows = CSV.read(csv_file, headers: true, converters: :numeric)
# Build Arrow table from CSV
columns = {}
rows.headers.each do |header|
columns[header] = rows[header]
end
table = Arrow::Table.new(columns)
Arrow::Table.save(table, parquet_file, format: :parquet)
end
# Convert JSON to Avro
require 'json'
require 'avro'
def json_to_avro(json_file, avro_file, schema)
data = JSON.parse(File.read(json_file))
file = File.open(avro_file, "wb")
writer = Avro::DataFile::Writer.new(
file,
Avro::IO::DatumWriter.new(schema),
schema
)
data.each { |record| writer << record }
writer.close
end
Database integration:
# PostgreSQL COPY for CSV
require 'pg'
conn = PG.connect(dbname: 'mydb')
conn.exec("COPY users FROM '/path/to/users.csv' WITH CSV HEADER")
# Export to CSV
conn.copy_data("COPY (SELECT * FROM users) TO STDOUT WITH CSV HEADER") do
File.open("export.csv", "w") do |f|
while row = conn.get_copy_data
f.write(row)
end
end
end
# JSON columns in PostgreSQL
conn.exec("INSERT INTO events (data) VALUES ($1)", [data.to_json])
result = conn.exec("SELECT data FROM events WHERE data->>'user_id' = '123'")
Reference
Format Comparison Matrix
| Characteristic | CSV | JSON | Parquet | Avro |
|---|---|---|---|---|
| Storage Model | Row-based text | Row-based text | Columnar binary | Row-based binary |
| Human Readable | Yes | Yes | No | No |
| Schema Required | No | No | Yes | Yes |
| Nested Data | No | Yes | Yes | Yes |
| Compression Efficiency | Low | Low | High | Medium |
| Read Performance | Slow | Slow | Fast for column queries | Fast for full rows |
| Write Performance | Fast | Fast | Slower | Fast |
| Query Optimization | None | None | Column pruning, predicate pushdown | Limited |
| Schema Evolution | None | Flexible | Limited | Full support |
| Streaming Support | Yes | Limited | No | Yes |
| Size for 1M Rows | ~100MB | ~150MB | ~10-20MB | ~30-40MB |
CSV Processing Reference
| Operation | Method | Notes |
|---|---|---|
| Read CSV | CSV.read(path, options) | Loads entire file into memory |
| Stream CSV | CSV.foreach(path, options) | Memory-efficient row-by-row |
| Parse String | CSV.parse(string, options) | Parse CSV from string |
| Generate CSV | CSV.generate(options) | Build CSV string |
| Write CSV | CSV.open(path, "w") | Write CSV file |
| Headers | headers: true | Parse with headers |
| Type Conversion | converters: :numeric | Convert types during parse |
| Custom Delimiter | col_sep: "\t" | Use tab or other separator |
| Quote Character | quote_char: "'" | Change quote character |
| Skip Blanks | skip_blanks: true | Ignore empty lines |
JSON Processing Reference
| Operation | Method | Notes |
|---|---|---|
| Parse JSON | JSON.parse(string) | Parse JSON string |
| Generate JSON | JSON.generate(object) | Ruby to JSON |
| Pretty Print | JSON.pretty_generate(object) | Formatted output |
| Symbol Keys | symbolize_names: true | Use symbols instead of strings |
| Custom Serialization | define to_json | Custom object encoding |
| Load from File | JSON.load(file) | Parse file directly |
| Streaming Parser | JSON::Stream::Parser | Memory-efficient parsing |
| Schema Validation | JSON::Validator.validate! | Validate against schema |
Parquet Operations Reference
| Operation | Code Pattern | Purpose |
|---|---|---|
| Load Table | Arrow::Table.load(path, format: :parquet) | Read Parquet file |
| Save Table | Arrow::Table.save(table, path, format: :parquet) | Write Parquet file |
| Column Projection | columns: ["col1", "col2"] | Read specific columns only |
| Compression | compression: :snappy | Set compression algorithm |
| Read Metadata | Arrow::Table.metadata(path) | Get file metadata |
| Row Count | table.n_rows | Number of rows |
| Column Count | table.n_columns | Number of columns |
| Schema | table.schema | Table schema |
| Filter Rows | table.filter { condition } | Row filtering |
| Group By | table.group_by(key).aggregate | Aggregation operations |
Avro Operations Reference
| Operation | Code Pattern | Purpose |
|---|---|---|
| Parse Schema | Avro::Schema.parse(json) | Load schema from JSON |
| Create Writer | Avro::DataFile::Writer.new | Initialize file writer |
| Write Record | writer << record | Add record to file |
| Create Reader | Avro::DataFile::Reader.new | Initialize file reader |
| Read Records | reader.each { block } | Iterate records |
| Schema Evolution | DatumReader.new(writer_schema, reader_schema) | Handle schema changes |
| Validate Data | Avro::Schema.validate(schema, data) | Check data validity |
| Schema Fingerprint | Avro::Schema.fingerprint(schema) | Schema hash |
Type Mapping Reference
| Ruby Type | CSV | JSON | Parquet | Avro |
|---|---|---|---|---|
| Integer | "123" | 123 | INT32/INT64 | int, long |
| Float | "99.99" | 99.99 | FLOAT/DOUBLE | float, double |
| String | text | "text" | STRING | string |
| Boolean | "true" | true | BOOLEAN | boolean |
| Date | "2025-01-15" | "2025-01-15" | DATE | int (days from epoch) |
| Timestamp | "2025-01-15T10:30:00Z" | "2025-01-15T10:30:00Z" | TIMESTAMP | long (milliseconds) |
| Decimal | "123.45" | 123.45 | DECIMAL | bytes (logical type) |
| Array | "val1,val2" | [val1, val2] | LIST | array |
| Object | JSON string | {key: val} | STRUCT | record |
| Null | "" | null | NULL | null (union type) |
Compression Algorithms
| Algorithm | CSV | JSON | Parquet | Avro | Speed | Ratio | Use Case |
|---|---|---|---|---|---|---|---|
| None | N/A | N/A | Default | Default | Fastest | 1x | Testing only |
| Gzip | External | External | Supported | Supported | Slow | 3-5x | Maximum compression |
| Snappy | External | External | Recommended | Supported | Fast | 2-3x | Balanced performance |
| Zstd | External | External | Supported | Supported | Balanced | 3-4x | Modern compression |
| LZ4 | External | External | Supported | N/A | Very Fast | 2x | High throughput |
Performance Characteristics
| Scenario | Recommended Format | Rationale |
|---|---|---|
| Small datasets under 10MB | CSV or JSON | Simplicity outweighs efficiency |
| Analytics queries | Parquet | Column pruning, compression |
| Full record access | Avro or CSV | Row-based reading efficient |
| Human inspection needed | CSV or JSON | Text formats readable |
| Schema changes frequent | JSON or Avro | Flexible or managed evolution |
| Storage cost critical | Parquet | Best compression ratios |
| Streaming required | CSV or Avro | Sequential read support |
| Cross-language exchange | JSON | Universal parser support |
| Type safety required | Parquet or Avro | Schema enforcement |
| Fast writes | CSV or JSON | Minimal encoding overhead |