CrackedRuby logo

CrackedRuby

CSV Reading and Writing

Overview

Ruby provides CSV handling through the CSV class in its standard library. The CSV class offers both high-level convenience methods and low-level streaming interfaces for processing comma-separated value files. Ruby's CSV implementation handles RFC 4180 compliance, encoding detection, and various delimiter configurations.

The CSV class operates in two primary modes: class-level convenience methods for simple operations and instance-based approaches for complex processing. Class methods like CSV.read and CSV.parse handle entire files or strings in memory, while streaming methods process data incrementally.

require 'csv'

# Reading entire file into array
data = CSV.read('sales.csv', headers: true)
data.each { |row| puts row['amount'] }

# Writing CSV data
CSV.open('output.csv', 'w') do |csv|
  csv << ['name', 'age', 'city']
  csv << ['Alice', 25, 'Boston']
end

Ruby converts CSV data into arrays by default, with each row represented as an array of string values. When headers are enabled, rows become CSV::Row objects that provide hash-like access to fields by column name. The library handles type conversion, quoting, and escape sequences according to CSV standards.

# Parsing with automatic header detection
CSV.foreach('data.csv', headers: true) do |row|
  puts "#{row['name']}: #{row['salary']}"
end

Basic Usage

Reading CSV files in Ruby starts with the CSV.read method for loading entire files or CSV.foreach for processing rows individually. The headers option controls whether the first row contains column names.

require 'csv'

# Read entire file into memory
all_data = CSV.read('employees.csv')
all_data.each do |row|
  puts row.join(' | ')
end

# Process one row at a time (memory efficient)
CSV.foreach('employees.csv', headers: true) do |row|
  puts "Employee: #{row['name']}, Salary: #{row['salary']}"
end

Writing CSV data uses CSV.open with a file mode or CSV.generate for creating strings. The CSV writer automatically handles quoting and escaping when values contain commas, newlines, or quotes.

# Writing to file
CSV.open('report.csv', 'w', write_headers: true, headers: ['ID', 'Name', 'Department']) do |csv|
  csv << [1, 'John Doe', 'Engineering']
  csv << [2, 'Jane Smith', 'Marketing']
end

# Generating CSV string
csv_string = CSV.generate do |csv|
  csv << ['Product', 'Price', 'Stock']
  csv << ['Laptop', '$999.99', '15']
  csv << ['Mouse', '$29.99', '200']
end

The CSV library handles different delimiters and quote characters through configuration options. Custom delimiters support tab-separated files, pipe-delimited data, and other formats.

# Tab-separated values
CSV.read('data.tsv', col_sep: "\t")

# Custom delimiter and quote character  
CSV.read('data.txt', col_sep: '|', quote_char: '"')

# Parsing CSV strings directly
data = "name,age\nAlice,25\nBob,30"
parsed = CSV.parse(data, headers: true)
parsed.each { |row| puts "#{row['name']} is #{row['age']}" }

Column access with headers creates CSV::Row objects that combine array and hash behavior. Rows support both numeric indexing and string-based field access.

CSV.foreach('users.csv', headers: true) do |row|
  # Hash-like access
  puts row['email']
  
  # Array-like access  
  puts row[0]
  
  # Iteration over fields
  row.each { |field| puts field[1] }
end

Advanced Usage

CSV processing becomes complex when handling data transformations, custom converters, and specialized parsing requirements. Ruby's CSV library provides converter systems for automatic type conversion and field processing.

Built-in converters handle common data types including numeric conversion, date parsing, and header field normalization. Custom converters extend this system for domain-specific transformations.

# Using built-in converters
CSV.read('sales.csv', 
  headers: true,
  converters: [:numeric, :date],
  header_converters: :symbol
)

# Custom converter for currency values
CSV::Converters[:currency] = ->(field) {
  if field.match?(/\$[\d,]+\.?\d*/)
    field.gsub(/[$,]/, '').to_f
  else
    field
  end
}

data = CSV.read('prices.csv', headers: true, converters: [:currency])

Streaming large CSV files requires careful memory management and error handling. The CSV.filter method processes files in a streaming fashion, transforming input to output without loading entire datasets.

# Stream processing large files
CSV.filter(input_file, output_file, headers: true) do |row|
  if row['status'] == 'active'
    row['processed_date'] = Time.now.iso8601
    row
  else
    nil  # Skip this row
  end
end

# Custom parsing with manual control
CSV.open('complex.csv', headers: true) do |csv|
  while row = csv.read
    # Complex business logic here
    process_row(row) if validate_row(row)
  end
end

Complex CSV structures with nested data, multiple header rows, or irregular formats require manual parsing approaches. The CSV library provides row-level access for handling non-standard formats.

# Handling multiple header sections
def parse_complex_csv(filename)
  sections = {}
  current_section = nil
  
  CSV.foreach(filename) do |row|
    if row[0]&.start_with?('[') && row[0]&.end_with?(']')
      current_section = row[0][1..-2]
      sections[current_section] = []
    elsif current_section && !row.all?(&:nil?)
      sections[current_section] << row
    end
  end
  
  sections
end

# Processing with row validation
CSV.foreach('data.csv', headers: true, skip_blanks: true) do |row|
  next unless row.fields.compact.length >= 3
  
  begin
    validated_row = validate_and_transform(row)
    process_business_logic(validated_row)
  rescue ValidationError => e
    log_error("Row #{csv.lineno}: #{e.message}")
  end
end

The CSV library supports liberal parsing modes for handling malformed data. Liberal parsing attempts to recover from common formatting errors while maintaining data integrity.

# Liberal parsing for malformed CSV
CSV.read('messy.csv', 
  headers: true,
  liberal_parsing: true,
  nil_value: '',
  empty_value: nil
)

Error Handling & Debugging

CSV parsing generates specific exceptions for different error conditions. CSV::MalformedCSVError occurs when encountering invalid CSV structure, while encoding errors arise from character set mismatches.

begin
  CSV.read('problematic.csv', headers: true)
rescue CSV::MalformedCSVError => e
  puts "CSV format error at line #{e.line_number}: #{e.message}"
  
  # Attempt recovery with liberal parsing
  CSV.read('problematic.csv', headers: true, liberal_parsing: true)
rescue Encoding::UndefinedConversionError => e
  puts "Encoding error: #{e.message}"
  
  # Retry with explicit encoding
  CSV.read('problematic.csv', headers: true, encoding: 'ISO-8859-1:UTF-8')
end

Field validation requires checking data types, ranges, and business rules during parsing. Implementing validation chains ensures data quality while maintaining processing performance.

class CSVValidator
  def self.validate_row(row, line_number)
    errors = []
    
    # Required field validation
    %w[name email].each do |field|
      errors << "Missing #{field}" if row[field].nil? || row[field].empty?
    end
    
    # Data type validation
    unless row['age']&.match?(/^\d+$/)
      errors << "Invalid age format"
    end
    
    # Business rule validation
    if row['salary']&.to_f&.< 0
      errors << "Negative salary not allowed"
    end
    
    raise ValidationError.new(errors.join(', '), line_number) unless errors.empty?
  end
end

# Usage with error logging
CSV.foreach('employees.csv', headers: true).with_index(2) do |row, line_num|
  begin
    CSVValidator.validate_row(row, line_num)
    process_employee(row)
  rescue ValidationError => e
    error_log << { line: e.line_number, errors: e.message, data: row.to_h }
  end
end

Encoding detection and conversion handles international character sets and legacy file formats. Ruby's CSV library integrates with the Encoding system for automatic character conversion.

# Detect and convert encoding
def safe_csv_read(filename)
  # Try UTF-8 first
  begin
    return CSV.read(filename, headers: true, encoding: 'UTF-8')
  rescue Encoding::InvalidByteSequenceError
    # Fall back to common encodings
    %w[ISO-8859-1 Windows-1252 UTF-16].each do |encoding|
      begin
        return CSV.read(filename, headers: true, encoding: "#{encoding}:UTF-8")
      rescue Encoding::UndefinedConversionError
        next
      end
    end
  end
  
  raise "Unable to determine file encoding"
end

Debugging CSV parsing issues requires examining raw file content, character codes, and parsing state. The CSV library provides introspection methods for troubleshooting.

# Debug CSV parsing issues
def debug_csv_file(filename)
  File.open(filename, 'rb') do |file|
    first_line = file.readline
    puts "Raw bytes: #{first_line.bytes.map { |b| b.chr rescue '?' }.join}"
    puts "Encoding: #{file.external_encoding}"
  end
  
  CSV.open(filename, headers: true) do |csv|
    csv.each_with_index do |row, index|
      puts "Line #{csv.lineno}: #{row.fields.length} fields"
      break if index > 5  # Examine first few rows
    end
  end
end

Performance & Memory

Large CSV file processing requires streaming approaches to avoid memory exhaustion. CSV.foreach processes rows individually without loading entire files, while CSV.filter transforms data in streaming fashion.

# Memory-efficient processing of large files
def process_large_csv(filename)
  processed_count = 0
  error_count = 0
  
  CSV.foreach(filename, headers: true) do |row|
    begin
      yield row
      processed_count += 1
    rescue => e
      error_count += 1
      next
    end
    
    # Progress reporting for long operations
    if processed_count % 10_000 == 0
      puts "Processed #{processed_count} rows, #{error_count} errors"
    end
  end
  
  { processed: processed_count, errors: error_count }
end

Performance optimization focuses on minimizing string allocations, avoiding unnecessary conversions, and using appropriate parsing options. Disabling unused features improves processing speed.

require 'benchmark'

# Compare parsing approaches
Benchmark.bm(20) do |x|
  x.report("CSV.read") do
    CSV.read('large.csv')
  end
  
  x.report("CSV.foreach") do
    CSV.foreach('large.csv') { |row| row }
  end
  
  x.report("minimal parsing") do
    CSV.foreach('large.csv', 
      headers: false,
      converters: [],
      skip_blanks: false
    ) { |row| row }
  end
end

Memory usage patterns differ significantly between parsing approaches. Streaming methods maintain constant memory usage regardless of file size, while batch methods scale linearly with data volume.

# Monitor memory usage during processing
require 'get_process_mem'

def memory_efficient_transform(input_file, output_file)
  start_memory = GetProcessMem.new.mb
  row_count = 0
  
  CSV.open(output_file, 'w', headers: true, 
           write_headers: true, 
           headers: %w[id name processed_date]) do |output|
    
    CSV.foreach(input_file, headers: true) do |row|
      transformed = transform_row(row)
      output << transformed if transformed
      
      row_count += 1
      if row_count % 1000 == 0
        current_memory = GetProcessMem.new.mb
        puts "Processed #{row_count} rows, Memory: #{current_memory - start_memory}MB"
      end
    end
  end
end

Parallel processing of CSV data requires careful coordination to maintain row order and handle shared resources. Ruby's threading model allows concurrent processing of independent row transformations.

require 'thread'

# Parallel CSV processing with thread pool
def parallel_csv_process(filename, thread_count: 4)
  queue = Queue.new
  results = Queue.new
  
  # Producer thread
  producer = Thread.new do
    CSV.foreach(filename, headers: true).with_index do |row, index|
      queue << [row, index]
    end
    thread_count.times { queue << :end }
  end
  
  # Worker threads
  workers = thread_count.times.map do
    Thread.new do
      while (item = queue.pop) != :end
        row, index = item
        processed = expensive_row_processing(row)
        results << [processed, index]
      end
    end
  end
  
  # Collector
  all_results = []
  collector = Thread.new do
    while workers.any?(&:alive?) || !results.empty?
      begin
        result, index = results.pop(true)
        all_results << [result, index]
      rescue ThreadError
        sleep 0.01
      end
    end
  end
  
  [producer, *workers, collector].each(&:join)
  all_results.sort_by(&:last).map(&:first)
end

Reference

Core Methods

Method Parameters Returns Description
CSV.read(path, **opts) path (String), options (Hash) Array<Array> Reads entire CSV file into array
CSV.foreach(path, **opts) path (String), options (Hash) Enumerator Yields each row to block
CSV.parse(string, **opts) string (String), options (Hash) Array<Array> Parses CSV string into array
CSV.generate(**opts) options (Hash) String Creates CSV string from block
CSV.open(path, mode, **opts) path (String), mode (String), options (Hash) CSV Opens CSV file for reading/writing
CSV.filter(input, output, **opts) input (IO), output (IO), options (Hash) nil Streams CSV transformation

Instance Methods

Method Parameters Returns Description
#read None Array or nil Reads next row from CSV
#readline None Array Reads next row, raises on EOF
#each Block Enumerator Iterates over all rows
#<<(row) row (Array) CSV Writes row to CSV
#flush None CSV Flushes output buffer
#close None nil Closes CSV file
#lineno None Integer Current line number
#rewind None CSV Resets file position

Parse Options

Option Type Default Description
:headers Boolean/Array/String false Enable header row processing
:col_sep String "," Column separator character
:row_sep String :auto Row separator sequence
:quote_char String '"' Quote character for fields
:field_size_limit Integer nil Maximum field size in characters
:converters Array/Symbol nil Field value converters
:unconverted_fields Boolean false Preserve original field values
:headers_converters Array/Symbol nil Header name converters
:return_headers Boolean false Include headers in output
:write_headers Boolean false Write headers when opening
:skip_blanks Boolean false Skip empty rows
:skip_lines Regexp nil Skip rows matching pattern
:liberal_parsing Boolean false Allow malformed CSV
:nil_value String nil String representing nil
:empty_value String "" String representing empty

Built-in Converters

Converter Description
:integer Convert integers to Integer objects
:float Convert floats to Float objects
:numeric Convert numbers to Integer/Float
:date Convert dates to Date objects
:date_time Convert datetimes to DateTime objects
:all Apply all numeric and date converters

Header Converters

Converter Description
:downcase Convert headers to lowercase
:symbol Convert headers to symbols

Exception Hierarchy

Exception Description
CSV::MalformedCSVError Invalid CSV format detected
CSV::Row::InvalidIndexError Invalid row index access
Encoding::InvalidByteSequenceError Invalid character encoding
Encoding::UndefinedConversionError Character conversion failure

CSV::Row Methods

Method Parameters Returns Description
#[] index (Integer/String) String Access field by index/header
#[]= index, value String Set field by index/header
#field(name, offset) name (String), offset (Integer) String Get field by name with offset
#fields(*headers) headers (Array) Array Get multiple fields
#values_at(*indices) indices (Array) Array Get fields at indices
#each Block Enumerator Iterate over header/value pairs
#to_h None Hash Convert to Hash
#to_a None Array Convert to Array
#headers None Array Get header names
#length None Integer Number of fields