CrackedRuby logo

CrackedRuby

String Iteration

A comprehensive guide to iterating over Ruby strings using characters, bytes, lines, and codepoints with performance considerations and common pitfalls.

Core Built-in Classes String Class
2.2.10

Overview

Ruby provides multiple methods for iterating over strings, each designed for specific use cases and data requirements. The string iteration API centers around four primary iteration modes: character-based, byte-based, line-based, and codepoint-based iteration. Each mode offers both enumerable methods that yield values to blocks and methods that return arrays for further processing.

Ruby's string iteration handles encoding automatically, converting between byte sequences and character representations based on the string's encoding. The iteration methods integrate with Ruby's Enumerable module, providing access to familiar methods like map, select, and reduce when working with string data.

The core iteration methods include each_char for Unicode character iteration, each_byte for raw byte access, each_line for text processing, and each_codepoint for Unicode codepoint access. Ruby also provides corresponding methods (chars, bytes, lines, codepoints) that return arrays instead of yielding to blocks.

text = "Hello\nWorld"
text.each_char { |char| puts char.inspect }
# => "H"
# => "e"
# => "l"
# => "l"
# => "o"
# => "\n" 
# => "W"
# => "o"
# => "r"
# => "l"
# => "d"

String iteration methods respect the string's encoding and handle multibyte characters correctly. When iterating over strings containing non-ASCII characters, Ruby manages the complexity of variable-width encodings transparently.

japanese = "こんにちは"
japanese.each_char.count  # => 5 (Unicode characters)
japanese.each_byte.count  # => 15 (UTF-8 bytes)

The iteration API integrates seamlessly with Ruby's block syntax and enumerable operations, making string processing both efficient and expressive for text analysis, data parsing, and content transformation tasks.

Basic Usage

Ruby's string iteration methods follow consistent naming patterns and behavior across different iteration modes. The each_* methods yield individual elements to blocks, while the corresponding array methods return collections for further processing.

Character iteration processes strings one Unicode character at a time, handling multibyte encodings automatically. This method works correctly with international text and emoji characters.

text = "Ruby 🚀"
chars = []
text.each_char { |c| chars << c }
chars  # => ["R", "u", "b", "y", " ", "🚀"]

# Equivalent using array method
text.chars  # => ["R", "u", "b", "y", " ", "🚀"]

Byte iteration provides access to the raw byte values that compose the string, useful for binary data processing, encoding analysis, and low-level string manipulation.

binary_data = "\x89PNG\r\n\x1a\n"
binary_data.each_byte do |byte|
  printf "%02x ", byte
end
# => 89 50 4e 47 0d 0a 1a 0a

# Convert to array for processing
bytes = binary_data.bytes
bytes.map { |b| b.to_s(16) }  # => ["89", "50", "4e", "47", "d", "a", "1a", "a"]

Line iteration splits strings on line terminators, supporting different line ending conventions. Ruby recognizes Unix (\n), Windows (\r\n), and classic Mac (\r) line endings automatically.

multiline = "First line\nSecond line\r\nThird line\r"
multiline.each_line do |line|
  puts line.inspect
end
# => "First line\n"
# => "Second line\r\n" 
# => "Third line\r"

# Process lines with chomp to remove terminators
multiline.each_line(chomp: true) do |line|
  puts "Line: #{line}"
end
# => Line: First line
# => Line: Second line
# => Line: Third line

Codepoint iteration yields Unicode codepoint values as integers, useful for Unicode analysis, character encoding conversion, and text normalization operations.

text = "café"
text.each_codepoint do |codepoint|
  puts "U+%04X (%s)" % [codepoint, codepoint.chr('UTF-8')]
end
# => U+0063 (c)
# => U+0061 (a)  
# => U+0066 (f)
# => U+00E9 (é)

# Extract codepoint ranges
text.codepoints.select { |cp| cp > 127 }  # => [233] (é)

The enumerable methods work with all iteration modes, enabling functional programming approaches to string processing.

text = "Hello World"
vowels = text.each_char.select { |c| 'aeiouAEIOU'.include?(c) }
vowels  # => ["e", "o", "o"]

# Chain operations for complex processing  
word_lengths = text.each_line.map(&:chomp).map(&:length)
consonants = text.chars.reject { |c| 'aeiouAEIOU '.include?(c) }

Performance & Memory

String iteration methods exhibit different performance characteristics depending on the iteration mode, string encoding, and processing requirements. Understanding these differences enables optimal method selection for specific use cases.

Character iteration with each_char performs encoding translation on each character access, making it slower than byte iteration for simple operations but necessary for correct Unicode handling. The performance impact increases with string length and encoding complexity.

require 'benchmark'

large_text = "a" * 100_000
unicode_text = "café" * 25_000

Benchmark.bm(15) do |x|
  x.report("each_byte:")     { large_text.each_byte { |b| b } }
  x.report("each_char:")     { large_text.each_char { |c| c } }
  x.report("each_codepoint:") { large_text.each_codepoint { |cp| cp } }
end

# Typical results:
#                      user     system      total        real
# each_byte:       0.005000   0.000000   0.005000 (  0.005234)
# each_char:       0.015000   0.000000   0.015000 (  0.015123)
# each_codepoint:  0.012000   0.000000   0.012000 (  0.012045)

Array-returning methods (chars, bytes, lines, codepoints) create intermediate arrays that consume additional memory proportional to the string size. For large strings, this memory overhead can be significant and may trigger garbage collection.

# Memory-efficient iteration (yields values, no intermediate array)
large_string = "content" * 50_000
processed_count = 0
large_string.each_char { |c| processed_count += 1 if c.match?(/[aeiou]/) }

# Memory-intensive approach (creates intermediate array)  
vowel_chars = large_string.chars.select { |c| c.match?(/[aeiou]/) }
vowel_count = vowel_chars.length

Line iteration performance depends heavily on line terminator distribution and line length variance. Ruby optimizes line splitting for common patterns but can be slower with irregular line structures.

# Efficient line processing for uniform data
log_data = ("timestamp,event,data\n" * 10_000)
log_data.each_line(chomp: true) do |line|
  # Process individual lines without storing all in memory
  fields = line.split(',')
end

# Less efficient: loads all lines into memory first
all_lines = log_data.lines(chomp: true)
processed_lines = all_lines.map { |line| line.split(',') }

Encoding overhead affects iteration performance differently across methods. ASCII-compatible strings iterate faster than complex multibyte encodings, with UTF-8 showing moderate overhead and UTF-16/UTF-32 showing higher overhead due to fixed-width character processing.

ascii_string = "simple text" * 10_000
utf8_string = "mixed text 中文" * 10_000  
utf16_string = utf8_string.encode('UTF-16')

# Performance varies significantly by encoding
Benchmark.bm(20) do |x|
  x.report("ASCII each_char:")   { ascii_string.each_char { |c| c } }
  x.report("UTF-8 each_char:")   { utf8_string.each_char { |c| c } }  
  x.report("UTF-16 each_char:")  { utf16_string.each_char { |c| c } }
end

For high-performance scenarios, consider streaming approaches that process string data in chunks rather than character-by-character iteration. This reduces method call overhead and improves cache locality.

def process_large_string_efficiently(text)
  # Process in chunks to reduce iteration overhead
  chunk_size = 1024
  (0...text.length).step(chunk_size) do |offset|
    chunk = text[offset, chunk_size]
    chunk.each_char { |char| yield char }
  end
end

Memory usage patterns vary significantly between iteration approaches. Block-based iteration maintains constant memory usage regardless of string size, while array-based methods scale linearly with input size and may cause memory pressure in resource-constrained environments.

Common Pitfalls

String iteration methods contain subtle behaviors that frequently cause confusion and bugs in Ruby programs. Understanding these edge cases prevents common mistakes and ensures reliable string processing.

The distinction between characters, bytes, and codepoints becomes critical when working with multibyte encodings. Developers often assume character count equals byte count, leading to incorrect buffer sizing and indexing errors.

text = "café"
puts "String: #{text}"
puts "Length: #{text.length}"           # => 4 (characters)
puts "Bytesize: #{text.bytesize}"      # => 5 (bytes, é = 2 bytes in UTF-8)
puts "Chars: #{text.chars.length}"     # => 4 (same as length)
puts "Bytes: #{text.bytes.length}"     # => 5 (actual byte count)

# Common mistake: assuming byte index equals character index
# This will fail with multibyte characters
byte_index = 3
# text[byte_index] # Don't do this - may split multibyte character
char_array = text.chars
safe_char = char_array[3]  # => "é" (correct approach)

Line iteration behavior with line terminators often surprises developers. Ruby preserves line terminators in the yielded strings by default, and the chomp option affects whether terminators are removed.

text_with_lines = "first\nsecond\nthird"

# Lines include terminators by default
text_with_lines.each_line { |line| puts line.inspect }
# => "first\n"
# => "second\n" 
# => "third"

# Empty line handling can be unexpected
text_with_empty = "line1\n\nline3\n"
lines = text_with_empty.lines
lines  # => ["line1\n", "\n", "line3\n"]
lines.length  # => 3 (not 2!)

# Use chomp to normalize line endings
normalized_lines = text_with_empty.lines(chomp: true).reject(&:empty?)
normalized_lines  # => ["line1", "line3"]

Encoding mismatches during iteration can cause Encoding::CompatibilityError exceptions, especially when combining strings from different sources or processing binary data as text.

# Problematic: mixing encodings during iteration
utf8_string = "hello".encode('UTF-8')
ascii_string = "world".encode('ASCII')

begin
  result = utf8_string.chars + ascii_string.chars
  # This works because ASCII is compatible with UTF-8
rescue Encoding::CompatibilityError => e
  puts "Encoding error: #{e.message}"
end

# Real problem: binary data treated as text
binary_data = "\x89\xFF\x00\x42"
begin
  binary_data.each_char { |char| puts char }
rescue ArgumentError => e
  puts "Invalid byte sequence: #{e.message}"
  # Solution: use each_byte for binary data
  binary_data.each_byte { |byte| printf "%02x ", byte }
end

Performance assumptions about iteration methods often lead to inefficient code. Developers frequently choose array-returning methods when streaming iteration would be more appropriate, causing unnecessary memory allocation.

# Inefficient: creates intermediate array for large datasets
large_log = File.read('huge_log.txt')
error_lines = large_log.lines.select { |line| line.include?('ERROR') }

# Efficient: streams through data without intermediate storage  
error_lines = []
File.open('huge_log.txt').each_line do |line|
  error_lines << line if line.include?('ERROR')
end

# Even better: process immediately without storage
error_count = 0
File.open('huge_log.txt').each_line do |line|
  error_count += 1 if line.include?('ERROR')
end

Modification of strings during iteration creates undefined behavior and can lead to infinite loops or corrupted data. Ruby strings are mutable, but modifying them during iteration breaks iterator state.

text = "abcdef"

# Dangerous: modifying string during iteration
# text.each_char.with_index do |char, index|
#   text[index] = char.upcase if char == 'c'  # Don't do this!
# end

# Safe: build new string or collect changes
chars = text.chars
chars.map! { |char| char == 'c' ? char.upcase : char }
result = chars.join  # => "abCdef"

# Alternative: use gsub for character replacement
result = text.gsub('c', 'C')  # => "abCdef"

Unicode normalization issues arise when comparing or processing text that appears identical but uses different Unicode encodings. Character iteration may yield different results for visually identical strings.

# These look the same but are different Unicode sequences
cafe1 = "café"                    # precomposed é (U+00E9)
cafe2 = "cafe" + "\u0301"        # e + combining acute accent (U+0301)

puts cafe1 == cafe2               # => false
puts cafe1.length                 # => 4  
puts cafe2.length                 # => 5

# Normalize for comparison
require 'unicode_normalize'
puts cafe1.unicode_normalize == cafe2.unicode_normalize  # => true

# Character iteration reveals the difference
cafe1.each_char { |c| puts c.inspect }  # => "c", "a", "f", "é"
cafe2.each_char { |c| puts c.inspect }  # => "c", "a", "f", "e", "́"

Reference

Core Iteration Methods

Method Parameters Returns Description
#each_char &block String or Enumerator Yields each Unicode character to block
#each_byte &block String or Enumerator Yields each byte value as integer to block
#each_line separator=$/, chomp: false, &block String or Enumerator Yields each line to block
#each_codepoint &block String or Enumerator Yields each Unicode codepoint as integer to block
#chars None Array Returns array of Unicode characters
#bytes None Array Returns array of byte values
#lines separator=$/, chomp: false Array Returns array of lines
#codepoints None Array Returns array of Unicode codepoints

Line Separator Options

Separator Behavior Example
"\n" Unix line endings "a\nb".each_line
"\r\n" Windows line endings "a\r\nb".each_line("\r\n")
"\r" Classic Mac line endings "a\rb".each_line("\r")
"" (empty string) Paragraph mode (blank line separated) "a\n\nb".each_line("")
nil Entire string as single line "a\nb".each_line(nil)

Enumerable Integration

All iteration methods return Enumerators when called without blocks, providing full Enumerable method access:

string = "Hello World"

# Mapping operations
string.each_char.map(&:upcase)           # => ["H", "E", "L", "L", "O", " ", "W", "O", "R", "L", "D"]
string.each_byte.map { |b| b.to_s(16) }  # => ["48", "65", "6c", "6c", "6f", "20", "57", "6f", "72", "6c", "64"]

# Filtering operations  
string.each_char.select { |c| c.match?(/[aeiou]/i) }  # => ["e", "o", "o"]
string.each_line.reject(&:empty?)                      # => ["Hello World"]

# Aggregation operations
string.each_char.count { |c| c.match?(/[A-Z]/) }      # => 2
string.each_byte.reduce(:+)                           # => 1052

Performance Characteristics

Method Memory Usage Speed Best For
each_char Constant Moderate Unicode text processing
each_byte Constant Fast Binary data, ASCII text
each_line Constant Variable Text file processing
each_codepoint Constant Moderate Unicode analysis
chars O(n) Moderate Array operations on characters
bytes O(n) Fast Array operations on bytes
lines O(n) Variable Array operations on lines
codepoints O(n) Moderate Array operations on codepoints

Encoding Considerations

Encoding Character Iteration Byte Count Ratio Notes
ASCII Fast 1:1 Single byte per character
UTF-8 Moderate 1:1 to 1:4 Variable width encoding
UTF-16 Slower 1:2 to 1:4 Fixed/variable width
UTF-32 Slower 1:4 Fixed width encoding
Binary N/A 1:1 Use byte iteration only

Common Error Types

Error Cause Solution
ArgumentError Invalid byte sequence Check encoding, use each_byte for binary
Encoding::CompatibilityError Mixing incompatible encodings Convert encodings before processing
NoMethodError Calling array method on Enumerator Add block or call array method
FrozenError Modifying frozen string during iteration Create new string or duplicate before modification

Line Ending Detection

def detect_line_endings(text)
  return :mixed if text.include?("\r\n") && (text.include?("\r") || text.include?("\n"))
  return :windows if text.include?("\r\n")
  return :mac if text.include?("\r")  
  return :unix if text.include?("\n")
  :none
end

Unicode Categories

# Common Unicode character categories for filtering
def categorize_chars(text)
  text.each_char.group_by do |char|
    case char
    when /\p{L}/     then :letter
    when /\p{N}/     then :number  
    when /\p{P}/     then :punctuation
    when /\p{S}/     then :symbol
    when /\p{Z}/     then :separator
    when /\p{C}/     then :control
    else                  :other
    end
  end
end