Overview
Ruby provides CSV handling through the CSV class in its standard library. The CSV class offers both high-level convenience methods and low-level streaming interfaces for processing comma-separated value files. Ruby's CSV implementation handles RFC 4180 compliance, encoding detection, and various delimiter configurations.
The CSV class operates in two primary modes: class-level convenience methods for simple operations and instance-based approaches for complex processing. Class methods like CSV.read
and CSV.parse
handle entire files or strings in memory, while streaming methods process data incrementally.
require 'csv'
# Reading entire file into array
data = CSV.read('sales.csv', headers: true)
data.each { |row| puts row['amount'] }
# Writing CSV data
CSV.open('output.csv', 'w') do |csv|
csv << ['name', 'age', 'city']
csv << ['Alice', 25, 'Boston']
end
Ruby converts CSV data into arrays by default, with each row represented as an array of string values. When headers are enabled, rows become CSV::Row objects that provide hash-like access to fields by column name. The library handles type conversion, quoting, and escape sequences according to CSV standards.
# Parsing with automatic header detection
CSV.foreach('data.csv', headers: true) do |row|
puts "#{row['name']}: #{row['salary']}"
end
Basic Usage
Reading CSV files in Ruby starts with the CSV.read
method for loading entire files or CSV.foreach
for processing rows individually. The headers
option controls whether the first row contains column names.
require 'csv'
# Read entire file into memory
all_data = CSV.read('employees.csv')
all_data.each do |row|
puts row.join(' | ')
end
# Process one row at a time (memory efficient)
CSV.foreach('employees.csv', headers: true) do |row|
puts "Employee: #{row['name']}, Salary: #{row['salary']}"
end
Writing CSV data uses CSV.open
with a file mode or CSV.generate
for creating strings. The CSV writer automatically handles quoting and escaping when values contain commas, newlines, or quotes.
# Writing to file
CSV.open('report.csv', 'w', write_headers: true, headers: ['ID', 'Name', 'Department']) do |csv|
csv << [1, 'John Doe', 'Engineering']
csv << [2, 'Jane Smith', 'Marketing']
end
# Generating CSV string
csv_string = CSV.generate do |csv|
csv << ['Product', 'Price', 'Stock']
csv << ['Laptop', '$999.99', '15']
csv << ['Mouse', '$29.99', '200']
end
The CSV library handles different delimiters and quote characters through configuration options. Custom delimiters support tab-separated files, pipe-delimited data, and other formats.
# Tab-separated values
CSV.read('data.tsv', col_sep: "\t")
# Custom delimiter and quote character
CSV.read('data.txt', col_sep: '|', quote_char: '"')
# Parsing CSV strings directly
data = "name,age\nAlice,25\nBob,30"
parsed = CSV.parse(data, headers: true)
parsed.each { |row| puts "#{row['name']} is #{row['age']}" }
Column access with headers creates CSV::Row objects that combine array and hash behavior. Rows support both numeric indexing and string-based field access.
CSV.foreach('users.csv', headers: true) do |row|
# Hash-like access
puts row['email']
# Array-like access
puts row[0]
# Iteration over fields
row.each { |field| puts field[1] }
end
Advanced Usage
CSV processing becomes complex when handling data transformations, custom converters, and specialized parsing requirements. Ruby's CSV library provides converter systems for automatic type conversion and field processing.
Built-in converters handle common data types including numeric conversion, date parsing, and header field normalization. Custom converters extend this system for domain-specific transformations.
# Using built-in converters
CSV.read('sales.csv',
headers: true,
converters: [:numeric, :date],
header_converters: :symbol
)
# Custom converter for currency values
CSV::Converters[:currency] = ->(field) {
if field.match?(/\$[\d,]+\.?\d*/)
field.gsub(/[$,]/, '').to_f
else
field
end
}
data = CSV.read('prices.csv', headers: true, converters: [:currency])
Streaming large CSV files requires careful memory management and error handling. The CSV.filter
method processes files in a streaming fashion, transforming input to output without loading entire datasets.
# Stream processing large files
CSV.filter(input_file, output_file, headers: true) do |row|
if row['status'] == 'active'
row['processed_date'] = Time.now.iso8601
row
else
nil # Skip this row
end
end
# Custom parsing with manual control
CSV.open('complex.csv', headers: true) do |csv|
while row = csv.read
# Complex business logic here
process_row(row) if validate_row(row)
end
end
Complex CSV structures with nested data, multiple header rows, or irregular formats require manual parsing approaches. The CSV library provides row-level access for handling non-standard formats.
# Handling multiple header sections
def parse_complex_csv(filename)
sections = {}
current_section = nil
CSV.foreach(filename) do |row|
if row[0]&.start_with?('[') && row[0]&.end_with?(']')
current_section = row[0][1..-2]
sections[current_section] = []
elsif current_section && !row.all?(&:nil?)
sections[current_section] << row
end
end
sections
end
# Processing with row validation
CSV.foreach('data.csv', headers: true, skip_blanks: true) do |row|
next unless row.fields.compact.length >= 3
begin
validated_row = validate_and_transform(row)
process_business_logic(validated_row)
rescue ValidationError => e
log_error("Row #{csv.lineno}: #{e.message}")
end
end
The CSV library supports liberal parsing modes for handling malformed data. Liberal parsing attempts to recover from common formatting errors while maintaining data integrity.
# Liberal parsing for malformed CSV
CSV.read('messy.csv',
headers: true,
liberal_parsing: true,
nil_value: '',
empty_value: nil
)
Error Handling & Debugging
CSV parsing generates specific exceptions for different error conditions. CSV::MalformedCSVError
occurs when encountering invalid CSV structure, while encoding errors arise from character set mismatches.
begin
CSV.read('problematic.csv', headers: true)
rescue CSV::MalformedCSVError => e
puts "CSV format error at line #{e.line_number}: #{e.message}"
# Attempt recovery with liberal parsing
CSV.read('problematic.csv', headers: true, liberal_parsing: true)
rescue Encoding::UndefinedConversionError => e
puts "Encoding error: #{e.message}"
# Retry with explicit encoding
CSV.read('problematic.csv', headers: true, encoding: 'ISO-8859-1:UTF-8')
end
Field validation requires checking data types, ranges, and business rules during parsing. Implementing validation chains ensures data quality while maintaining processing performance.
class CSVValidator
def self.validate_row(row, line_number)
errors = []
# Required field validation
%w[name email].each do |field|
errors << "Missing #{field}" if row[field].nil? || row[field].empty?
end
# Data type validation
unless row['age']&.match?(/^\d+$/)
errors << "Invalid age format"
end
# Business rule validation
if row['salary']&.to_f&.< 0
errors << "Negative salary not allowed"
end
raise ValidationError.new(errors.join(', '), line_number) unless errors.empty?
end
end
# Usage with error logging
CSV.foreach('employees.csv', headers: true).with_index(2) do |row, line_num|
begin
CSVValidator.validate_row(row, line_num)
process_employee(row)
rescue ValidationError => e
error_log << { line: e.line_number, errors: e.message, data: row.to_h }
end
end
Encoding detection and conversion handles international character sets and legacy file formats. Ruby's CSV library integrates with the Encoding system for automatic character conversion.
# Detect and convert encoding
def safe_csv_read(filename)
# Try UTF-8 first
begin
return CSV.read(filename, headers: true, encoding: 'UTF-8')
rescue Encoding::InvalidByteSequenceError
# Fall back to common encodings
%w[ISO-8859-1 Windows-1252 UTF-16].each do |encoding|
begin
return CSV.read(filename, headers: true, encoding: "#{encoding}:UTF-8")
rescue Encoding::UndefinedConversionError
next
end
end
end
raise "Unable to determine file encoding"
end
Debugging CSV parsing issues requires examining raw file content, character codes, and parsing state. The CSV library provides introspection methods for troubleshooting.
# Debug CSV parsing issues
def debug_csv_file(filename)
File.open(filename, 'rb') do |file|
first_line = file.readline
puts "Raw bytes: #{first_line.bytes.map { |b| b.chr rescue '?' }.join}"
puts "Encoding: #{file.external_encoding}"
end
CSV.open(filename, headers: true) do |csv|
csv.each_with_index do |row, index|
puts "Line #{csv.lineno}: #{row.fields.length} fields"
break if index > 5 # Examine first few rows
end
end
end
Performance & Memory
Large CSV file processing requires streaming approaches to avoid memory exhaustion. CSV.foreach
processes rows individually without loading entire files, while CSV.filter
transforms data in streaming fashion.
# Memory-efficient processing of large files
def process_large_csv(filename)
processed_count = 0
error_count = 0
CSV.foreach(filename, headers: true) do |row|
begin
yield row
processed_count += 1
rescue => e
error_count += 1
next
end
# Progress reporting for long operations
if processed_count % 10_000 == 0
puts "Processed #{processed_count} rows, #{error_count} errors"
end
end
{ processed: processed_count, errors: error_count }
end
Performance optimization focuses on minimizing string allocations, avoiding unnecessary conversions, and using appropriate parsing options. Disabling unused features improves processing speed.
require 'benchmark'
# Compare parsing approaches
Benchmark.bm(20) do |x|
x.report("CSV.read") do
CSV.read('large.csv')
end
x.report("CSV.foreach") do
CSV.foreach('large.csv') { |row| row }
end
x.report("minimal parsing") do
CSV.foreach('large.csv',
headers: false,
converters: [],
skip_blanks: false
) { |row| row }
end
end
Memory usage patterns differ significantly between parsing approaches. Streaming methods maintain constant memory usage regardless of file size, while batch methods scale linearly with data volume.
# Monitor memory usage during processing
require 'get_process_mem'
def memory_efficient_transform(input_file, output_file)
start_memory = GetProcessMem.new.mb
row_count = 0
CSV.open(output_file, 'w', headers: true,
write_headers: true,
headers: %w[id name processed_date]) do |output|
CSV.foreach(input_file, headers: true) do |row|
transformed = transform_row(row)
output << transformed if transformed
row_count += 1
if row_count % 1000 == 0
current_memory = GetProcessMem.new.mb
puts "Processed #{row_count} rows, Memory: #{current_memory - start_memory}MB"
end
end
end
end
Parallel processing of CSV data requires careful coordination to maintain row order and handle shared resources. Ruby's threading model allows concurrent processing of independent row transformations.
require 'thread'
# Parallel CSV processing with thread pool
def parallel_csv_process(filename, thread_count: 4)
queue = Queue.new
results = Queue.new
# Producer thread
producer = Thread.new do
CSV.foreach(filename, headers: true).with_index do |row, index|
queue << [row, index]
end
thread_count.times { queue << :end }
end
# Worker threads
workers = thread_count.times.map do
Thread.new do
while (item = queue.pop) != :end
row, index = item
processed = expensive_row_processing(row)
results << [processed, index]
end
end
end
# Collector
all_results = []
collector = Thread.new do
while workers.any?(&:alive?) || !results.empty?
begin
result, index = results.pop(true)
all_results << [result, index]
rescue ThreadError
sleep 0.01
end
end
end
[producer, *workers, collector].each(&:join)
all_results.sort_by(&:last).map(&:first)
end
Reference
Core Methods
Method | Parameters | Returns | Description |
---|---|---|---|
CSV.read(path, **opts) |
path (String), options (Hash) |
Array<Array> |
Reads entire CSV file into array |
CSV.foreach(path, **opts) |
path (String), options (Hash) |
Enumerator |
Yields each row to block |
CSV.parse(string, **opts) |
string (String), options (Hash) |
Array<Array> |
Parses CSV string into array |
CSV.generate(**opts) |
options (Hash) | String |
Creates CSV string from block |
CSV.open(path, mode, **opts) |
path (String), mode (String), options (Hash) |
CSV |
Opens CSV file for reading/writing |
CSV.filter(input, output, **opts) |
input (IO), output (IO), options (Hash) |
nil |
Streams CSV transformation |
Instance Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#read |
None | Array or nil |
Reads next row from CSV |
#readline |
None | Array |
Reads next row, raises on EOF |
#each |
Block | Enumerator |
Iterates over all rows |
#<<(row) |
row (Array) |
CSV |
Writes row to CSV |
#flush |
None | CSV |
Flushes output buffer |
#close |
None | nil |
Closes CSV file |
#lineno |
None | Integer |
Current line number |
#rewind |
None | CSV |
Resets file position |
Parse Options
Option | Type | Default | Description |
---|---|---|---|
:headers |
Boolean/Array/String |
false |
Enable header row processing |
:col_sep |
String |
"," |
Column separator character |
:row_sep |
String |
:auto |
Row separator sequence |
:quote_char |
String |
'"' |
Quote character for fields |
:field_size_limit |
Integer |
nil |
Maximum field size in characters |
:converters |
Array/Symbol |
nil |
Field value converters |
:unconverted_fields |
Boolean |
false |
Preserve original field values |
:headers_converters |
Array/Symbol |
nil |
Header name converters |
:return_headers |
Boolean |
false |
Include headers in output |
:write_headers |
Boolean |
false |
Write headers when opening |
:skip_blanks |
Boolean |
false |
Skip empty rows |
:skip_lines |
Regexp |
nil |
Skip rows matching pattern |
:liberal_parsing |
Boolean |
false |
Allow malformed CSV |
:nil_value |
String |
nil |
String representing nil |
:empty_value |
String |
"" |
String representing empty |
Built-in Converters
Converter | Description |
---|---|
:integer |
Convert integers to Integer objects |
:float |
Convert floats to Float objects |
:numeric |
Convert numbers to Integer/Float |
:date |
Convert dates to Date objects |
:date_time |
Convert datetimes to DateTime objects |
:all |
Apply all numeric and date converters |
Header Converters
Converter | Description |
---|---|
:downcase |
Convert headers to lowercase |
:symbol |
Convert headers to symbols |
Exception Hierarchy
Exception | Description |
---|---|
CSV::MalformedCSVError |
Invalid CSV format detected |
CSV::Row::InvalidIndexError |
Invalid row index access |
Encoding::InvalidByteSequenceError |
Invalid character encoding |
Encoding::UndefinedConversionError |
Character conversion failure |
CSV::Row Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#[] |
index (Integer/String) |
String |
Access field by index/header |
#[]= |
index , value |
String |
Set field by index/header |
#field(name, offset) |
name (String), offset (Integer) |
String |
Get field by name with offset |
#fields(*headers) |
headers (Array) |
Array |
Get multiple fields |
#values_at(*indices) |
indices (Array) |
Array |
Get fields at indices |
#each |
Block | Enumerator |
Iterate over header/value pairs |
#to_h |
None | Hash |
Convert to Hash |
#to_a |
None | Array |
Convert to Array |
#headers |
None | Array |
Get header names |
#length |
None | Integer |
Number of fields |