CrackedRuby - Unicode Support

Overview

Ruby provides Unicode support through its string encoding system, which handles character data in various encodings including UTF-8, UTF-16, ASCII, and dozens of other character sets. The core of Ruby's Unicode support centers on two main classes: String and Encoding.

Every string in Ruby has an associated encoding that determines how its bytes are interpreted as characters. Ruby distinguishes between a string's actual byte content and its declared encoding, allowing for explicit control over text processing operations.

# String with UTF-8 encoding
text = "Hello, 世界"
text.encoding
# => #<Encoding:UTF-8>

# Check character vs byte count
text.length     # => 9 characters
text.bytesize   # => 13 bytes

The String#encode method performs encoding conversion, transforming text from one character set to another. This method handles the complex process of mapping characters between different Unicode representations and legacy encodings.

# Convert UTF-8 to UTF-16
utf16_text = "Café".encode("UTF-16LE")
utf16_text.bytes
# => [67, 0, 97, 0, 102, 0, 233, 0]

Ruby's encoding system supports over 100 different character encodings, accessible through the Encoding class. Each encoding defines how bytes map to characters and handles different ranges of the Unicode code point space.

# List available encodings
Encoding.list.size
# => 101

# Get specific encoding
Encoding::UTF_8
# => #<Encoding:UTF-8>

The String#force_encoding method changes a string's encoding designation without converting bytes, which differs from String#encode that actually transforms the byte sequence. This distinction is critical for handling binary data or fixing encoding mismatches.

Basic Usage

String creation automatically assigns encoding based on source file encoding or runtime default. Ruby defaults to UTF-8 for string literals in most environments.

# Default encoding assignment
default_string = "Hello"
default_string.encoding
# => #<Encoding:UTF-8>

# Explicit encoding with magic comment
# -*- coding: iso-8859-1 -*-
latin1_string = "café"
latin1_string.encoding
# => #<Encoding:ISO-8859-1>

The String#encode method converts between encodings, handling character mapping and byte sequence transformation. Options control behavior for characters that cannot be represented in the target encoding.

# Basic encoding conversion
original = "Héllo wørld"
ascii_version = original.encode("ASCII", invalid: :replace, undef: :replace)
# => "H?llo w?rld"

# Conversion with custom replacement
original.encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "H?llo w?rld"

Reading files requires careful encoding handling. Ruby attempts to detect encoding from byte order marks or defaults to the external encoding setting.

# Reading with explicit encoding
content = File.read("data.txt", encoding: "UTF-8")

# Reading binary then forcing encoding
binary_content = File.read("data.txt", mode: "rb")
text_content = binary_content.force_encoding("UTF-8")

String manipulation methods behave differently depending on encoding. Character-based operations count Unicode code points, while byte-based operations work with raw byte sequences.

multilingual = "English日本語Русский"

# Character operations
multilingual.length         # => 16 characters
multilingual[7, 3]         # => "日本語"
multilingual.upcase        # => "ENGLISH日本語РУССКИЙ"

# Byte operations  
multilingual.bytesize      # => 28 bytes
multilingual.bytes.size    # => 28

Regular expressions respect string encoding and match Unicode categories. The encoding of pattern and target string must be compatible.

text = "Price: $123.45"
# Match Unicode currency symbols
currency_pattern = /\p{Currency_Symbol}/u
text.scan(currency_pattern)
# => ["$"]

# Match Unicode word characters including non-ASCII
words = "Hello 世界 test".scan(/\p{Word}+/)
# => ["Hello", "世界", "test"]

String comparison operations consider encoding compatibility. Strings with different encodings but identical character content may not compare as equal without conversion.

utf8_text = "café".encode("UTF-8")
latin1_text = "café".encode("ISO-8859-1")

# Direct comparison fails
utf8_text == latin1_text
# => false

# Convert for comparison
utf8_text == latin1_text.encode("UTF-8")
# => true

Error Handling & Debugging

Ruby raises specific exceptions when encoding operations encounter problems. Encoding::InvalidByteSequenceError occurs when byte sequences don't form valid characters in the specified encoding.

# Invalid UTF-8 byte sequence
invalid_bytes = "\xFF\xFE".force_encoding("UTF-8")

begin
  invalid_bytes.encode("UTF-16")
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid sequence: #{e.error_char.dump}"
  puts "Position: #{e.source_encoding}"
end

Encoding::UndefinedConversionError appears when characters exist in source encoding but cannot be represented in target encoding.

# Unicode character not in ASCII
unicode_text = "Café éñ español"

begin
  ascii_result = unicode_text.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
  puts "Cannot convert: #{e.error_char} (U+#{e.error_char.ord.to_s(16).upcase})"
  puts "From: #{e.source_encoding} to #{e.destination_encoding}"
  
  # Retry with replacement
  ascii_result = unicode_text.encode("ASCII", undef: :replace, replace: "?")
end

The String#valid_encoding? method checks whether byte content forms valid characters according to the string's declared encoding.

# Valid UTF-8
valid_text = "Hello 世界"
valid_text.valid_encoding?
# => true

# Invalid UTF-8 bytes
invalid_utf8 = "\x80\x81".force_encoding("UTF-8")  
invalid_utf8.valid_encoding?
# => false

# Same bytes valid in different encoding
valid_binary = invalid_utf8.force_encoding("ASCII-8BIT")
valid_binary.valid_encoding?
# => true

Debugging encoding issues requires examining both declared encoding and actual byte content. The String#bytes method reveals raw byte values while String#codepoints shows Unicode code points.

problematic_text = "café"

puts "Encoding: #{problematic_text.encoding}"
puts "Bytes: #{problematic_text.bytes.map { |b| sprintf("%02X", b) }}"
puts "Codepoints: #{problematic_text.codepoints.map { |c| sprintf("U+%04X", c) }}"

# Detect common encoding problems
def diagnose_encoding(str)
  if str.encoding.name == "ASCII-8BIT" && str.bytes.any? { |b| b > 127 }
    "Likely UTF-8 data forced to binary encoding"
  elsif !str.valid_encoding?
    "Invalid byte sequence for declared encoding"
  elsif str.encoding.name == "UTF-8" && str.bytes.include?(0xEF, 0xBB, 0xBF)
    "Contains UTF-8 BOM"
  else
    "Encoding appears correct"
  end
end

Transcoding errors can be handled with fallback strategies. The :replace option substitutes problematic characters, while :ignore removes them entirely.

# Multiple error handling strategies
source_text = "Résumé with emoji 🚀 and symbols ™®"

strategies = {
  strict: { invalid: :replace, undef: :replace, replace: "" },
  loose: { invalid: :replace, undef: :replace, replace: "?" },
  ignore: { invalid: :ignore, undef: :ignore }
}

strategies.each do |name, options|
  begin
    result = source_text.encode("ASCII", **options)
    puts "#{name}: '#{result}'"
  rescue => e
    puts "#{name}: Error - #{e.class}"
  end
end

Common Pitfalls

ASCII-8BIT encoding creates frequent confusion because Ruby treats it as binary data rather than character data. String operations may produce unexpected results when ASCII-8BIT strings contain high-bit bytes.

# Binary string with UTF-8 bytes
binary_data = File.read("utf8_file.txt", mode: "rb")  # ASCII-8BIT encoding
binary_data.encoding
# => #<Encoding:ASCII-8BIT>

# Character operations don't work as expected
binary_data.upcase    # May corrupt UTF-8 sequences
binary_data.length    # Counts bytes, not characters

# Correct approach: force proper encoding first
text_data = binary_data.force_encoding("UTF-8")
if text_data.valid_encoding?
  text_data.upcase    # Now works correctly
end

Byte Order Marks (BOM) cause problems when present in UTF-8 strings. Many systems add BOM to UTF-8 files, but Ruby treats BOM bytes as regular characters.

# String with UTF-8 BOM
bom_string = "\uFEFF" + "Hello world"
bom_string.length
# => 12 (includes BOM character)

# BOM interferes with string operations
bom_string.start_with?("Hello")
# => false

# Remove BOM for processing
clean_string = bom_string.delete_prefix("\uFEFF")
clean_string.start_with?("Hello")  
# => true

# Check for BOM presence
def has_utf8_bom?(str)
  str.encoding == Encoding::UTF_8 && str.start_with?("\uFEFF")
end

Mixing strings with different encodings in operations like concatenation or comparison produces encoding compatibility errors.

utf8_string = "Hello".encode("UTF-8")
ascii_string = "World".encode("US-ASCII")

# This works - ASCII is subset of UTF-8
combined = utf8_string + " " + ascii_string
combined.encoding
# => #<Encoding:UTF-8>

# This fails with incompatible encodings
binary_string = "data".force_encoding("ASCII-8BIT")
begin
  result = utf8_string + binary_string
rescue Encoding::CompatibilityError => e
  puts "Cannot mix: #{e.message}"
  # Convert to compatible encoding
  result = utf8_string + binary_string.force_encoding("UTF-8")
end

Regular expressions inherit encoding from their source, which must be compatible with target strings. Pattern encoding mismatches cause runtime errors.

# UTF-8 pattern
pattern = /café/
utf8_text = "I love café au lait"

# Works fine
matches = utf8_text.scan(pattern)

# Fails with different encoding
latin1_text = utf8_text.encode("ISO-8859-1")
begin
  latin1_matches = latin1_text.scan(pattern)
rescue Encoding::CompatibilityError
  # Convert pattern to match text encoding  
  latin1_pattern = Regexp.new(pattern.source.encode("ISO-8859-1"))
  latin1_matches = latin1_text.scan(latin1_pattern)
end

Default external encoding affects file I/O and can cause data corruption if mismatched with actual file encoding. Always specify encoding explicitly when file encoding is known.

# Dangerous - relies on default encoding
content = File.read("japanese.txt")  # May corrupt data

# Safe - explicit encoding specification  
content = File.read("japanese.txt", encoding: "UTF-8")

# Check default encodings
puts "External: #{Encoding.default_external}"
puts "Internal: #{Encoding.default_internal}"

# Override defaults for specific operations
Encoding.default_external = "UTF-8"
Encoding.default_internal = "UTF-8"

Character normalization differences cause identical-looking strings to compare as unequal. Unicode allows multiple byte representations for the same visual characters.

# Two ways to represent é
composed = "café"      # é as single character (U+00E9)
decomposed = "cafe\u0301"  # e + combining acute (U+0065 U+0301)

# Look identical but compare as different
composed == decomposed
# => false

composed.length        # => 4
decomposed.length      # => 5

# Ruby doesn't include built-in normalization
# Manual normalization required for reliable comparison
def normalize_string(str)
  # Simple approach - would need full Unicode normalization library
  str.unicode_normalize(:nfc)
end

Performance & Memory

UTF-8 provides optimal performance for most text processing in Ruby because it serves as the default internal encoding. Operations on UTF-8 strings avoid conversion overhead that affects other encodings.

require 'benchmark'

text_samples = {
  utf8: "Hello 世界 testing performance with UTF-8",
  utf16: "Hello 世界 testing performance with UTF-8".encode("UTF-16LE"),  
  ascii: "Hello world testing performance with ASCII".encode("US-ASCII")
}

# Benchmark string operations
Benchmark.bmbm do |x|
  text_samples.each do |encoding, sample|
    x.report("#{encoding}_upcase") { 10_000.times { sample.upcase } }
    x.report("#{encoding}_scan") { 10_000.times { sample.scan(/\w+/) } }
  end
end

Memory usage varies significantly between encodings. UTF-32 uses 4 bytes per character regardless of character complexity, while UTF-8 uses 1-4 bytes per character based on Unicode code point range.

# Memory comparison for same text
base_text = "Mixed: ASCII + 中文 + العربية + 🌟"

encodings = ["UTF-8", "UTF-16LE", "UTF-32LE", "ASCII-8BIT"]
memory_usage = {}

encodings.each do |enc|
  encoded = base_text.encode(enc) rescue next
  memory_usage[enc] = {
    characters: encoded.length,
    bytes: encoded.bytesize,
    ratio: encoded.bytesize.to_f / encoded.length
  }
end

memory_usage.each do |enc, stats|
  puts "#{enc}: #{stats[:bytes]} bytes for #{stats[:characters]} chars (#{stats[:ratio].round(1)} bytes/char)"
end

Large text processing benefits from streaming approaches that avoid loading entire files into memory. Process text in chunks while maintaining encoding boundaries.

def process_large_file(filename, chunk_size = 8192)
  File.open(filename, "r:UTF-8") do |file|
    buffer = ""
    
    while chunk = file.read(chunk_size)
      buffer += chunk
      
      # Process complete lines to avoid splitting multi-byte characters
      lines = buffer.split("\n", -1)
      buffer = lines.pop || ""  # Keep incomplete line
      
      lines.each do |line|
        yield line if block_given?
      end
    end
    
    # Process remaining buffer
    yield buffer unless buffer.empty?
  end
end

# Usage with large file
process_large_file("large_unicode.txt") do |line|
  # Process line without loading entire file
  processed = line.upcase.gsub(/\s+/, " ")
end

Encoding conversion performance varies by source and target encodings. UTF-8 to UTF-16 conversion is faster than complex legacy encoding transformations.

# Performance comparison for different conversions
test_text = File.read("sample.txt", encoding: "UTF-8")

conversions = [
  ["UTF-8", "UTF-16LE"],
  ["UTF-8", "ISO-8859-1"],
  ["UTF-8", "Shift_JIS"],
  ["UTF-8", "UTF-32LE"]
]

Benchmark.bmbm do |x|
  conversions.each do |from, to|
    x.report("#{from}_to_#{to}") do
      1000.times { test_text.encode(to, invalid: :replace, undef: :replace) }
    end
  end
end

String interpolation and concatenation with mixed encodings triggers automatic conversion. Pre-converting all strings to the same encoding eliminates repeated conversion overhead.

# Inefficient - conversion happens repeatedly
mixed_strings = ["Hello", "世界".encode("UTF-16LE"), "test"]
result = ""
1000.times do
  mixed_strings.each { |s| result += s }  # Converts UTF-16LE each time
end

# Efficient - convert once
utf8_strings = mixed_strings.map { |s| s.encode("UTF-8") }
result = ""
1000.times do
  utf8_strings.each { |s| result += s }  # No conversion needed
end

Reference

Core Classes and Methods

Method	Parameters	Returns	Description
`String#encoding`	None	`Encoding`	Returns string's current encoding
`String#encode(encoding, **opts)`	encoding (String/Encoding), options (Hash)	`String`	Convert string to different encoding
`String#encode!(encoding, **opts)`	encoding (String/Encoding), options (Hash)	`String`	In-place encoding conversion
`String#force_encoding(encoding)`	encoding (String/Encoding)	`String`	Change encoding without converting bytes
`String#valid_encoding?`	None	`Boolean`	Check if bytes are valid for encoding
`String#ascii_only?`	None	`Boolean`	True if all characters are ASCII
`String#bytes`	None	`Array<Integer>`	Array of byte values
`String#codepoints`	None	`Array<Integer>`	Array of Unicode code points
`String#bytesize`	None	`Integer`	Number of bytes in string

Encoding Class Methods

Method	Parameters	Returns	Description
`Encoding.list`	None	`Array<Encoding>`	All available encodings
`Encoding.find(name)`	name (String)	`Encoding`	Find encoding by name
`Encoding.compatible?(str1, str2)`	str1, str2 (String)	`Encoding` or `nil`	Common encoding or nil
`Encoding.default_external`	None	`Encoding`	Default for file I/O
`Encoding.default_internal`	None	`Encoding` or `nil`	Default conversion target

Encoding Conversion Options

Option	Values	Description
`:invalid`	`:replace`, `:ignore`	Handle invalid byte sequences
`:undef`	`:replace`, `:ignore`	Handle undefined character conversions
`:replace`	String	Replacement string for problematic characters
`:fallback`	Hash or Proc	Custom character mapping
`:xml`	`:text`, `:attr`	XML-specific escaping rules
`:cr_newline`	Boolean	Convert CR to newline
`:crlf_newline`	Boolean	Convert CRLF to newline
`:universal_newline`	Boolean	Convert all newlines to LF

Common Encodings

Encoding	Description	Bytes per Character	Use Case
`UTF-8`	Unicode 8-bit	1-4	Default, web content, modern files
`UTF-16LE`	Unicode 16-bit little endian	2-4	Windows systems, some databases
`UTF-32LE`	Unicode 32-bit little endian	4	Fixed-width Unicode processing
`US-ASCII`	7-bit ASCII	1	Legacy systems, protocol headers
`ASCII-8BIT`	Binary data	1	Binary files, network protocols
`ISO-8859-1`	Latin-1	1	Legacy European text
`Windows-1252`	Windows Latin	1	Windows legacy files
`Shift_JIS`	Japanese	1-2	Japanese legacy systems

Exception Classes

Exception	Trigger	Common Cause
`Encoding::InvalidByteSequenceError`	Invalid bytes for encoding	Corrupted data, wrong encoding
`Encoding::UndefinedConversionError`	Character not in target encoding	Unicode to ASCII conversion
`Encoding::CompatibilityError`	Incompatible encodings	Mixed encoding operations
`Encoding::ConverterNotFoundError`	Unknown encoding conversion	Typo in encoding name

Regular Expression Encoding Flags

Flag	Description
`//u`	UTF-8 encoding
`//e`	EUC-JP encoding
`//s`	Windows-31J encoding
`//n`	ASCII-8BIT encoding

File I/O Encoding Options

# Reading with encoding
File.read("file.txt", encoding: "UTF-8")
File.read("file.txt", encoding: "UTF-8:UTF-16LE")  # external:internal

# Writing with encoding  
File.write("file.txt", data, encoding: "UTF-8")

# Open with encoding
File.open("file.txt", "r:UTF-8:UTF-16LE") do |f|
  # Read UTF-8, convert to UTF-16LE
end