CrackedRuby - CSV Reading and Writing

Overview

Ruby provides CSV handling through the CSV class in its standard library. The CSV class offers both high-level convenience methods and low-level streaming interfaces for processing comma-separated value files. Ruby's CSV implementation handles RFC 4180 compliance, encoding detection, and various delimiter configurations.

The CSV class operates in two primary modes: class-level convenience methods for simple operations and instance-based approaches for complex processing. Class methods like CSV.read and CSV.parse handle entire files or strings in memory, while streaming methods process data incrementally.

require 'csv'

# Reading entire file into array
data = CSV.read('sales.csv', headers: true)
data.each { |row| puts row['amount'] }

# Writing CSV data
CSV.open('output.csv', 'w') do |csv|
  csv << ['name', 'age', 'city']
  csv << ['Alice', 25, 'Boston']
end

Ruby converts CSV data into arrays by default, with each row represented as an array of string values. When headers are enabled, rows become CSV::Row objects that provide hash-like access to fields by column name. The library handles type conversion, quoting, and escape sequences according to CSV standards.

# Parsing with automatic header detection
CSV.foreach('data.csv', headers: true) do |row|
  puts "#{row['name']}: #{row['salary']}"
end

Basic Usage

Reading CSV files in Ruby starts with the CSV.read method for loading entire files or CSV.foreach for processing rows individually. The headers option controls whether the first row contains column names.

require 'csv'

# Read entire file into memory
all_data = CSV.read('employees.csv')
all_data.each do |row|
  puts row.join(' | ')
end

# Process one row at a time (memory efficient)
CSV.foreach('employees.csv', headers: true) do |row|
  puts "Employee: #{row['name']}, Salary: #{row['salary']}"
end

Writing CSV data uses CSV.open with a file mode or CSV.generate for creating strings. The CSV writer automatically handles quoting and escaping when values contain commas, newlines, or quotes.

# Writing to file
CSV.open('report.csv', 'w', write_headers: true, headers: ['ID', 'Name', 'Department']) do |csv|
  csv << [1, 'John Doe', 'Engineering']
  csv << [2, 'Jane Smith', 'Marketing']
end

# Generating CSV string
csv_string = CSV.generate do |csv|
  csv << ['Product', 'Price', 'Stock']
  csv << ['Laptop', '$999.99', '15']
  csv << ['Mouse', '$29.99', '200']
end

The CSV library handles different delimiters and quote characters through configuration options. Custom delimiters support tab-separated files, pipe-delimited data, and other formats.

# Tab-separated values
CSV.read('data.tsv', col_sep: "\t")

# Custom delimiter and quote character  
CSV.read('data.txt', col_sep: '|', quote_char: '"')

# Parsing CSV strings directly
data = "name,age\nAlice,25\nBob,30"
parsed = CSV.parse(data, headers: true)
parsed.each { |row| puts "#{row['name']} is #{row['age']}" }

Column access with headers creates CSV::Row objects that combine array and hash behavior. Rows support both numeric indexing and string-based field access.

CSV.foreach('users.csv', headers: true) do |row|
  # Hash-like access
  puts row['email']
  
  # Array-like access  
  puts row[0]
  
  # Iteration over fields
  row.each { |field| puts field[1] }
end

Advanced Usage

CSV processing becomes complex when handling data transformations, custom converters, and specialized parsing requirements. Ruby's CSV library provides converter systems for automatic type conversion and field processing.

Built-in converters handle common data types including numeric conversion, date parsing, and header field normalization. Custom converters extend this system for domain-specific transformations.

# Using built-in converters
CSV.read('sales.csv', 
  headers: true,
  converters: [:numeric, :date],
  header_converters: :symbol
)

# Custom converter for currency values
CSV::Converters[:currency] = ->(field) {
  if field.match?(/\$[\d,]+\.?\d*/)
    field.gsub(/[$,]/, '').to_f
  else
    field
  end
}

data = CSV.read('prices.csv', headers: true, converters: [:currency])

Streaming large CSV files requires careful memory management and error handling. The CSV.filter method processes files in a streaming fashion, transforming input to output without loading entire datasets.

# Stream processing large files
CSV.filter(input_file, output_file, headers: true) do |row|
  if row['status'] == 'active'
    row['processed_date'] = Time.now.iso8601
    row
  else
    nil  # Skip this row
  end
end

# Custom parsing with manual control
CSV.open('complex.csv', headers: true) do |csv|
  while row = csv.read
    # Complex business logic here
    process_row(row) if validate_row(row)
  end
end

Complex CSV structures with nested data, multiple header rows, or irregular formats require manual parsing approaches. The CSV library provides row-level access for handling non-standard formats.

# Handling multiple header sections
def parse_complex_csv(filename)
  sections = {}
  current_section = nil
  
  CSV.foreach(filename) do |row|
    if row[0]&.start_with?('[') && row[0]&.end_with?(']')
      current_section = row[0][1..-2]
      sections[current_section] = []
    elsif current_section && !row.all?(&:nil?)
      sections[current_section] << row
    end
  end
  
  sections
end

# Processing with row validation
CSV.foreach('data.csv', headers: true, skip_blanks: true) do |row|
  next unless row.fields.compact.length >= 3
  
  begin
    validated_row = validate_and_transform(row)
    process_business_logic(validated_row)
  rescue ValidationError => e
    log_error("Row #{csv.lineno}: #{e.message}")
  end
end

The CSV library supports liberal parsing modes for handling malformed data. Liberal parsing attempts to recover from common formatting errors while maintaining data integrity.

# Liberal parsing for malformed CSV
CSV.read('messy.csv', 
  headers: true,
  liberal_parsing: true,
  nil_value: '',
  empty_value: nil
)

Error Handling & Debugging

CSV parsing generates specific exceptions for different error conditions. CSV::MalformedCSVError occurs when encountering invalid CSV structure, while encoding errors arise from character set mismatches.

begin
  CSV.read('problematic.csv', headers: true)
rescue CSV::MalformedCSVError => e
  puts "CSV format error at line #{e.line_number}: #{e.message}"
  
  # Attempt recovery with liberal parsing
  CSV.read('problematic.csv', headers: true, liberal_parsing: true)
rescue Encoding::UndefinedConversionError => e
  puts "Encoding error: #{e.message}"
  
  # Retry with explicit encoding
  CSV.read('problematic.csv', headers: true, encoding: 'ISO-8859-1:UTF-8')
end

Field validation requires checking data types, ranges, and business rules during parsing. Implementing validation chains ensures data quality while maintaining processing performance.

class CSVValidator
  def self.validate_row(row, line_number)
    errors = []
    
    # Required field validation
    %w[name email].each do |field|
      errors << "Missing #{field}" if row[field].nil? || row[field].empty?
    end
    
    # Data type validation
    unless row['age']&.match?(/^\d+$/)
      errors << "Invalid age format"
    end
    
    # Business rule validation
    if row['salary']&.to_f&.< 0
      errors << "Negative salary not allowed"
    end
    
    raise ValidationError.new(errors.join(', '), line_number) unless errors.empty?
  end
end

# Usage with error logging
CSV.foreach('employees.csv', headers: true).with_index(2) do |row, line_num|
  begin
    CSVValidator.validate_row(row, line_num)
    process_employee(row)
  rescue ValidationError => e
    error_log << { line: e.line_number, errors: e.message, data: row.to_h }
  end
end

Encoding detection and conversion handles international character sets and legacy file formats. Ruby's CSV library integrates with the Encoding system for automatic character conversion.

# Detect and convert encoding
def safe_csv_read(filename)
  # Try UTF-8 first
  begin
    return CSV.read(filename, headers: true, encoding: 'UTF-8')
  rescue Encoding::InvalidByteSequenceError
    # Fall back to common encodings
    %w[ISO-8859-1 Windows-1252 UTF-16].each do |encoding|
      begin
        return CSV.read(filename, headers: true, encoding: "#{encoding}:UTF-8")
      rescue Encoding::UndefinedConversionError
        next
      end
    end
  end
  
  raise "Unable to determine file encoding"
end

Debugging CSV parsing issues requires examining raw file content, character codes, and parsing state. The CSV library provides introspection methods for troubleshooting.

# Debug CSV parsing issues
def debug_csv_file(filename)
  File.open(filename, 'rb') do |file|
    first_line = file.readline
    puts "Raw bytes: #{first_line.bytes.map { |b| b.chr rescue '?' }.join}"
    puts "Encoding: #{file.external_encoding}"
  end
  
  CSV.open(filename, headers: true) do |csv|
    csv.each_with_index do |row, index|
      puts "Line #{csv.lineno}: #{row.fields.length} fields"
      break if index > 5  # Examine first few rows
    end
  end
end

Performance & Memory

Large CSV file processing requires streaming approaches to avoid memory exhaustion. CSV.foreach processes rows individually without loading entire files, while CSV.filter transforms data in streaming fashion.

# Memory-efficient processing of large files
def process_large_csv(filename)
  processed_count = 0
  error_count = 0
  
  CSV.foreach(filename, headers: true) do |row|
    begin
      yield row
      processed_count += 1
    rescue => e
      error_count += 1
      next
    end
    
    # Progress reporting for long operations
    if processed_count % 10_000 == 0
      puts "Processed #{processed_count} rows, #{error_count} errors"
    end
  end
  
  { processed: processed_count, errors: error_count }
end

Performance optimization focuses on minimizing string allocations, avoiding unnecessary conversions, and using appropriate parsing options. Disabling unused features improves processing speed.

require 'benchmark'

# Compare parsing approaches
Benchmark.bm(20) do |x|
  x.report("CSV.read") do
    CSV.read('large.csv')
  end
  
  x.report("CSV.foreach") do
    CSV.foreach('large.csv') { |row| row }
  end
  
  x.report("minimal parsing") do
    CSV.foreach('large.csv', 
      headers: false,
      converters: [],
      skip_blanks: false
    ) { |row| row }
  end
end

Memory usage patterns differ significantly between parsing approaches. Streaming methods maintain constant memory usage regardless of file size, while batch methods scale linearly with data volume.

# Monitor memory usage during processing
require 'get_process_mem'

def memory_efficient_transform(input_file, output_file)
  start_memory = GetProcessMem.new.mb
  row_count = 0
  
  CSV.open(output_file, 'w', headers: true, 
           write_headers: true, 
           headers: %w[id name processed_date]) do |output|
    
    CSV.foreach(input_file, headers: true) do |row|
      transformed = transform_row(row)
      output << transformed if transformed
      
      row_count += 1
      if row_count % 1000 == 0
        current_memory = GetProcessMem.new.mb
        puts "Processed #{row_count} rows, Memory: #{current_memory - start_memory}MB"
      end
    end
  end
end

Parallel processing of CSV data requires careful coordination to maintain row order and handle shared resources. Ruby's threading model allows concurrent processing of independent row transformations.

require 'thread'

# Parallel CSV processing with thread pool
def parallel_csv_process(filename, thread_count: 4)
  queue = Queue.new
  results = Queue.new
  
  # Producer thread
  producer = Thread.new do
    CSV.foreach(filename, headers: true).with_index do |row, index|
      queue << [row, index]
    end
    thread_count.times { queue << :end }
  end
  
  # Worker threads
  workers = thread_count.times.map do
    Thread.new do
      while (item = queue.pop) != :end
        row, index = item
        processed = expensive_row_processing(row)
        results << [processed, index]
      end
    end
  end
  
  # Collector
  all_results = []
  collector = Thread.new do
    while workers.any?(&:alive?) || !results.empty?
      begin
        result, index = results.pop(true)
        all_results << [result, index]
      rescue ThreadError
        sleep 0.01
      end
    end
  end
  
  [producer, *workers, collector].each(&:join)
  all_results.sort_by(&:last).map(&:first)
end

Reference

Core Methods

Method	Parameters	Returns	Description
`CSV.read(path, **opts)`	`path` (String), options (Hash)	`Array<Array>`	Reads entire CSV file into array
`CSV.foreach(path, **opts)`	`path` (String), options (Hash)	`Enumerator`	Yields each row to block
`CSV.parse(string, **opts)`	`string` (String), options (Hash)	`Array<Array>`	Parses CSV string into array
`CSV.generate(**opts)`	options (Hash)	`String`	Creates CSV string from block
`CSV.open(path, mode, **opts)`	`path` (String), `mode` (String), options (Hash)	`CSV`	Opens CSV file for reading/writing
`CSV.filter(input, output, **opts)`	`input` (IO), `output` (IO), options (Hash)	`nil`	Streams CSV transformation

Instance Methods

Method	Parameters	Returns	Description
`#read`	None	`Array` or `nil`	Reads next row from CSV
`#readline`	None	`Array`	Reads next row, raises on EOF
`#each`	Block	`Enumerator`	Iterates over all rows
`#<<(row)`	`row` (Array)	`CSV`	Writes row to CSV
`#flush`	None	`CSV`	Flushes output buffer
`#close`	None	`nil`	Closes CSV file
`#lineno`	None	`Integer`	Current line number
`#rewind`	None	`CSV`	Resets file position

Parse Options

Option	Type	Default	Description
`:headers`	`Boolean/Array/String`	`false`	Enable header row processing
`:col_sep`	`String`	`","`	Column separator character
`:row_sep`	`String`	`:auto`	Row separator sequence
`:quote_char`	`String`	`'"'`	Quote character for fields
`:field_size_limit`	`Integer`	`nil`	Maximum field size in characters
`:converters`	`Array/Symbol`	`nil`	Field value converters
`:unconverted_fields`	`Boolean`	`false`	Preserve original field values
`:headers_converters`	`Array/Symbol`	`nil`	Header name converters
`:return_headers`	`Boolean`	`false`	Include headers in output
`:write_headers`	`Boolean`	`false`	Write headers when opening
`:skip_blanks`	`Boolean`	`false`	Skip empty rows
`:skip_lines`	`Regexp`	`nil`	Skip rows matching pattern
`:liberal_parsing`	`Boolean`	`false`	Allow malformed CSV
`:nil_value`	`String`	`nil`	String representing nil
`:empty_value`	`String`	`""`	String representing empty

Built-in Converters

Converter	Description
`:integer`	Convert integers to Integer objects
`:float`	Convert floats to Float objects
`:numeric`	Convert numbers to Integer/Float
`:date`	Convert dates to Date objects
`:date_time`	Convert datetimes to DateTime objects
`:all`	Apply all numeric and date converters

Header Converters

Converter	Description
`:downcase`	Convert headers to lowercase
`:symbol`	Convert headers to symbols

Exception Hierarchy

Exception	Description
`CSV::MalformedCSVError`	Invalid CSV format detected
`CSV::Row::InvalidIndexError`	Invalid row index access
`Encoding::InvalidByteSequenceError`	Invalid character encoding
`Encoding::UndefinedConversionError`	Character conversion failure

CSV::Row Methods

Method	Parameters	Returns	Description
`#[]`	`index` (Integer/String)	`String`	Access field by index/header
`#[]=`	`index`, `value`	`String`	Set field by index/header
`#field(name, offset)`	`name` (String), `offset` (Integer)	`String`	Get field by name with offset
`#fields(*headers)`	`headers` (Array)	`Array`	Get multiple fields
`#values_at(*indices)`	`indices` (Array)	`Array`	Get fields at indices
`#each`	Block	`Enumerator`	Iterate over header/value pairs
`#to_h`	None	`Hash`	Convert to Hash
`#to_a`	None	`Array`	Convert to Array
`#headers`	None	`Array`	Get header names
`#length`	None	`Integer`	Number of fields