Overview
Ruby provides multiple methods for iterating over strings, each designed for specific use cases and data requirements. The string iteration API centers around four primary iteration modes: character-based, byte-based, line-based, and codepoint-based iteration. Each mode offers both enumerable methods that yield values to blocks and methods that return arrays for further processing.
Ruby's string iteration handles encoding automatically, converting between byte sequences and character representations based on the string's encoding. The iteration methods integrate with Ruby's Enumerable module, providing access to familiar methods like map
, select
, and reduce
when working with string data.
The core iteration methods include each_char
for Unicode character iteration, each_byte
for raw byte access, each_line
for text processing, and each_codepoint
for Unicode codepoint access. Ruby also provides corresponding methods (chars
, bytes
, lines
, codepoints
) that return arrays instead of yielding to blocks.
text = "Hello\nWorld"
text.each_char { |char| puts char.inspect }
# => "H"
# => "e"
# => "l"
# => "l"
# => "o"
# => "\n"
# => "W"
# => "o"
# => "r"
# => "l"
# => "d"
String iteration methods respect the string's encoding and handle multibyte characters correctly. When iterating over strings containing non-ASCII characters, Ruby manages the complexity of variable-width encodings transparently.
japanese = "こんにちは"
japanese.each_char.count # => 5 (Unicode characters)
japanese.each_byte.count # => 15 (UTF-8 bytes)
The iteration API integrates seamlessly with Ruby's block syntax and enumerable operations, making string processing both efficient and expressive for text analysis, data parsing, and content transformation tasks.
Basic Usage
Ruby's string iteration methods follow consistent naming patterns and behavior across different iteration modes. The each_*
methods yield individual elements to blocks, while the corresponding array methods return collections for further processing.
Character iteration processes strings one Unicode character at a time, handling multibyte encodings automatically. This method works correctly with international text and emoji characters.
text = "Ruby 🚀"
chars = []
text.each_char { |c| chars << c }
chars # => ["R", "u", "b", "y", " ", "🚀"]
# Equivalent using array method
text.chars # => ["R", "u", "b", "y", " ", "🚀"]
Byte iteration provides access to the raw byte values that compose the string, useful for binary data processing, encoding analysis, and low-level string manipulation.
binary_data = "\x89PNG\r\n\x1a\n"
binary_data.each_byte do |byte|
printf "%02x ", byte
end
# => 89 50 4e 47 0d 0a 1a 0a
# Convert to array for processing
bytes = binary_data.bytes
bytes.map { |b| b.to_s(16) } # => ["89", "50", "4e", "47", "d", "a", "1a", "a"]
Line iteration splits strings on line terminators, supporting different line ending conventions. Ruby recognizes Unix (\n
), Windows (\r\n
), and classic Mac (\r
) line endings automatically.
multiline = "First line\nSecond line\r\nThird line\r"
multiline.each_line do |line|
puts line.inspect
end
# => "First line\n"
# => "Second line\r\n"
# => "Third line\r"
# Process lines with chomp to remove terminators
multiline.each_line(chomp: true) do |line|
puts "Line: #{line}"
end
# => Line: First line
# => Line: Second line
# => Line: Third line
Codepoint iteration yields Unicode codepoint values as integers, useful for Unicode analysis, character encoding conversion, and text normalization operations.
text = "café"
text.each_codepoint do |codepoint|
puts "U+%04X (%s)" % [codepoint, codepoint.chr('UTF-8')]
end
# => U+0063 (c)
# => U+0061 (a)
# => U+0066 (f)
# => U+00E9 (é)
# Extract codepoint ranges
text.codepoints.select { |cp| cp > 127 } # => [233] (é)
The enumerable methods work with all iteration modes, enabling functional programming approaches to string processing.
text = "Hello World"
vowels = text.each_char.select { |c| 'aeiouAEIOU'.include?(c) }
vowels # => ["e", "o", "o"]
# Chain operations for complex processing
word_lengths = text.each_line.map(&:chomp).map(&:length)
consonants = text.chars.reject { |c| 'aeiouAEIOU '.include?(c) }
Performance & Memory
String iteration methods exhibit different performance characteristics depending on the iteration mode, string encoding, and processing requirements. Understanding these differences enables optimal method selection for specific use cases.
Character iteration with each_char
performs encoding translation on each character access, making it slower than byte iteration for simple operations but necessary for correct Unicode handling. The performance impact increases with string length and encoding complexity.
require 'benchmark'
large_text = "a" * 100_000
unicode_text = "café" * 25_000
Benchmark.bm(15) do |x|
x.report("each_byte:") { large_text.each_byte { |b| b } }
x.report("each_char:") { large_text.each_char { |c| c } }
x.report("each_codepoint:") { large_text.each_codepoint { |cp| cp } }
end
# Typical results:
# user system total real
# each_byte: 0.005000 0.000000 0.005000 ( 0.005234)
# each_char: 0.015000 0.000000 0.015000 ( 0.015123)
# each_codepoint: 0.012000 0.000000 0.012000 ( 0.012045)
Array-returning methods (chars
, bytes
, lines
, codepoints
) create intermediate arrays that consume additional memory proportional to the string size. For large strings, this memory overhead can be significant and may trigger garbage collection.
# Memory-efficient iteration (yields values, no intermediate array)
large_string = "content" * 50_000
processed_count = 0
large_string.each_char { |c| processed_count += 1 if c.match?(/[aeiou]/) }
# Memory-intensive approach (creates intermediate array)
vowel_chars = large_string.chars.select { |c| c.match?(/[aeiou]/) }
vowel_count = vowel_chars.length
Line iteration performance depends heavily on line terminator distribution and line length variance. Ruby optimizes line splitting for common patterns but can be slower with irregular line structures.
# Efficient line processing for uniform data
log_data = ("timestamp,event,data\n" * 10_000)
log_data.each_line(chomp: true) do |line|
# Process individual lines without storing all in memory
fields = line.split(',')
end
# Less efficient: loads all lines into memory first
all_lines = log_data.lines(chomp: true)
processed_lines = all_lines.map { |line| line.split(',') }
Encoding overhead affects iteration performance differently across methods. ASCII-compatible strings iterate faster than complex multibyte encodings, with UTF-8 showing moderate overhead and UTF-16/UTF-32 showing higher overhead due to fixed-width character processing.
ascii_string = "simple text" * 10_000
utf8_string = "mixed text 中文" * 10_000
utf16_string = utf8_string.encode('UTF-16')
# Performance varies significantly by encoding
Benchmark.bm(20) do |x|
x.report("ASCII each_char:") { ascii_string.each_char { |c| c } }
x.report("UTF-8 each_char:") { utf8_string.each_char { |c| c } }
x.report("UTF-16 each_char:") { utf16_string.each_char { |c| c } }
end
For high-performance scenarios, consider streaming approaches that process string data in chunks rather than character-by-character iteration. This reduces method call overhead and improves cache locality.
def process_large_string_efficiently(text)
# Process in chunks to reduce iteration overhead
chunk_size = 1024
(0...text.length).step(chunk_size) do |offset|
chunk = text[offset, chunk_size]
chunk.each_char { |char| yield char }
end
end
Memory usage patterns vary significantly between iteration approaches. Block-based iteration maintains constant memory usage regardless of string size, while array-based methods scale linearly with input size and may cause memory pressure in resource-constrained environments.
Common Pitfalls
String iteration methods contain subtle behaviors that frequently cause confusion and bugs in Ruby programs. Understanding these edge cases prevents common mistakes and ensures reliable string processing.
The distinction between characters, bytes, and codepoints becomes critical when working with multibyte encodings. Developers often assume character count equals byte count, leading to incorrect buffer sizing and indexing errors.
text = "café"
puts "String: #{text}"
puts "Length: #{text.length}" # => 4 (characters)
puts "Bytesize: #{text.bytesize}" # => 5 (bytes, é = 2 bytes in UTF-8)
puts "Chars: #{text.chars.length}" # => 4 (same as length)
puts "Bytes: #{text.bytes.length}" # => 5 (actual byte count)
# Common mistake: assuming byte index equals character index
# This will fail with multibyte characters
byte_index = 3
# text[byte_index] # Don't do this - may split multibyte character
char_array = text.chars
safe_char = char_array[3] # => "é" (correct approach)
Line iteration behavior with line terminators often surprises developers. Ruby preserves line terminators in the yielded strings by default, and the chomp
option affects whether terminators are removed.
text_with_lines = "first\nsecond\nthird"
# Lines include terminators by default
text_with_lines.each_line { |line| puts line.inspect }
# => "first\n"
# => "second\n"
# => "third"
# Empty line handling can be unexpected
text_with_empty = "line1\n\nline3\n"
lines = text_with_empty.lines
lines # => ["line1\n", "\n", "line3\n"]
lines.length # => 3 (not 2!)
# Use chomp to normalize line endings
normalized_lines = text_with_empty.lines(chomp: true).reject(&:empty?)
normalized_lines # => ["line1", "line3"]
Encoding mismatches during iteration can cause Encoding::CompatibilityError exceptions, especially when combining strings from different sources or processing binary data as text.
# Problematic: mixing encodings during iteration
utf8_string = "hello".encode('UTF-8')
ascii_string = "world".encode('ASCII')
begin
result = utf8_string.chars + ascii_string.chars
# This works because ASCII is compatible with UTF-8
rescue Encoding::CompatibilityError => e
puts "Encoding error: #{e.message}"
end
# Real problem: binary data treated as text
binary_data = "\x89\xFF\x00\x42"
begin
binary_data.each_char { |char| puts char }
rescue ArgumentError => e
puts "Invalid byte sequence: #{e.message}"
# Solution: use each_byte for binary data
binary_data.each_byte { |byte| printf "%02x ", byte }
end
Performance assumptions about iteration methods often lead to inefficient code. Developers frequently choose array-returning methods when streaming iteration would be more appropriate, causing unnecessary memory allocation.
# Inefficient: creates intermediate array for large datasets
large_log = File.read('huge_log.txt')
error_lines = large_log.lines.select { |line| line.include?('ERROR') }
# Efficient: streams through data without intermediate storage
error_lines = []
File.open('huge_log.txt').each_line do |line|
error_lines << line if line.include?('ERROR')
end
# Even better: process immediately without storage
error_count = 0
File.open('huge_log.txt').each_line do |line|
error_count += 1 if line.include?('ERROR')
end
Modification of strings during iteration creates undefined behavior and can lead to infinite loops or corrupted data. Ruby strings are mutable, but modifying them during iteration breaks iterator state.
text = "abcdef"
# Dangerous: modifying string during iteration
# text.each_char.with_index do |char, index|
# text[index] = char.upcase if char == 'c' # Don't do this!
# end
# Safe: build new string or collect changes
chars = text.chars
chars.map! { |char| char == 'c' ? char.upcase : char }
result = chars.join # => "abCdef"
# Alternative: use gsub for character replacement
result = text.gsub('c', 'C') # => "abCdef"
Unicode normalization issues arise when comparing or processing text that appears identical but uses different Unicode encodings. Character iteration may yield different results for visually identical strings.
# These look the same but are different Unicode sequences
cafe1 = "café" # precomposed é (U+00E9)
cafe2 = "cafe" + "\u0301" # e + combining acute accent (U+0301)
puts cafe1 == cafe2 # => false
puts cafe1.length # => 4
puts cafe2.length # => 5
# Normalize for comparison
require 'unicode_normalize'
puts cafe1.unicode_normalize == cafe2.unicode_normalize # => true
# Character iteration reveals the difference
cafe1.each_char { |c| puts c.inspect } # => "c", "a", "f", "é"
cafe2.each_char { |c| puts c.inspect } # => "c", "a", "f", "e", "́"
Reference
Core Iteration Methods
Method | Parameters | Returns | Description |
---|---|---|---|
#each_char |
&block |
String or Enumerator |
Yields each Unicode character to block |
#each_byte |
&block |
String or Enumerator |
Yields each byte value as integer to block |
#each_line |
separator=$/ , chomp: false , &block |
String or Enumerator |
Yields each line to block |
#each_codepoint |
&block |
String or Enumerator |
Yields each Unicode codepoint as integer to block |
#chars |
None | Array |
Returns array of Unicode characters |
#bytes |
None | Array |
Returns array of byte values |
#lines |
separator=$/ , chomp: false |
Array |
Returns array of lines |
#codepoints |
None | Array |
Returns array of Unicode codepoints |
Line Separator Options
Separator | Behavior | Example |
---|---|---|
"\n" |
Unix line endings | "a\nb".each_line |
"\r\n" |
Windows line endings | "a\r\nb".each_line("\r\n") |
"\r" |
Classic Mac line endings | "a\rb".each_line("\r") |
"" (empty string) |
Paragraph mode (blank line separated) | "a\n\nb".each_line("") |
nil |
Entire string as single line | "a\nb".each_line(nil) |
Enumerable Integration
All iteration methods return Enumerators when called without blocks, providing full Enumerable method access:
string = "Hello World"
# Mapping operations
string.each_char.map(&:upcase) # => ["H", "E", "L", "L", "O", " ", "W", "O", "R", "L", "D"]
string.each_byte.map { |b| b.to_s(16) } # => ["48", "65", "6c", "6c", "6f", "20", "57", "6f", "72", "6c", "64"]
# Filtering operations
string.each_char.select { |c| c.match?(/[aeiou]/i) } # => ["e", "o", "o"]
string.each_line.reject(&:empty?) # => ["Hello World"]
# Aggregation operations
string.each_char.count { |c| c.match?(/[A-Z]/) } # => 2
string.each_byte.reduce(:+) # => 1052
Performance Characteristics
Method | Memory Usage | Speed | Best For |
---|---|---|---|
each_char |
Constant | Moderate | Unicode text processing |
each_byte |
Constant | Fast | Binary data, ASCII text |
each_line |
Constant | Variable | Text file processing |
each_codepoint |
Constant | Moderate | Unicode analysis |
chars |
O(n) | Moderate | Array operations on characters |
bytes |
O(n) | Fast | Array operations on bytes |
lines |
O(n) | Variable | Array operations on lines |
codepoints |
O(n) | Moderate | Array operations on codepoints |
Encoding Considerations
Encoding | Character Iteration | Byte Count Ratio | Notes |
---|---|---|---|
ASCII | Fast | 1:1 | Single byte per character |
UTF-8 | Moderate | 1:1 to 1:4 | Variable width encoding |
UTF-16 | Slower | 1:2 to 1:4 | Fixed/variable width |
UTF-32 | Slower | 1:4 | Fixed width encoding |
Binary | N/A | 1:1 | Use byte iteration only |
Common Error Types
Error | Cause | Solution |
---|---|---|
ArgumentError |
Invalid byte sequence | Check encoding, use each_byte for binary |
Encoding::CompatibilityError |
Mixing incompatible encodings | Convert encodings before processing |
NoMethodError |
Calling array method on Enumerator | Add block or call array method |
FrozenError |
Modifying frozen string during iteration | Create new string or duplicate before modification |
Line Ending Detection
def detect_line_endings(text)
return :mixed if text.include?("\r\n") && (text.include?("\r") || text.include?("\n"))
return :windows if text.include?("\r\n")
return :mac if text.include?("\r")
return :unix if text.include?("\n")
:none
end
Unicode Categories
# Common Unicode character categories for filtering
def categorize_chars(text)
text.each_char.group_by do |char|
case char
when /\p{L}/ then :letter
when /\p{N}/ then :number
when /\p{P}/ then :punctuation
when /\p{S}/ then :symbol
when /\p{Z}/ then :separator
when /\p{C}/ then :control
else :other
end
end
end