CrackedRuby logo

CrackedRuby

Unicode Support

A comprehensive guide to handling Unicode text encoding, conversion, and character operations in Ruby.

Standard Library Internationalization
4.13.3

Overview

Ruby provides Unicode support through its string encoding system, which handles character data in various encodings including UTF-8, UTF-16, ASCII, and dozens of other character sets. The core of Ruby's Unicode support centers on two main classes: String and Encoding.

Every string in Ruby has an associated encoding that determines how its bytes are interpreted as characters. Ruby distinguishes between a string's actual byte content and its declared encoding, allowing for explicit control over text processing operations.

# String with UTF-8 encoding
text = "Hello, 世界"
text.encoding
# => #<Encoding:UTF-8>

# Check character vs byte count
text.length     # => 9 characters
text.bytesize   # => 13 bytes

The String#encode method performs encoding conversion, transforming text from one character set to another. This method handles the complex process of mapping characters between different Unicode representations and legacy encodings.

# Convert UTF-8 to UTF-16
utf16_text = "Café".encode("UTF-16LE")
utf16_text.bytes
# => [67, 0, 97, 0, 102, 0, 233, 0]

Ruby's encoding system supports over 100 different character encodings, accessible through the Encoding class. Each encoding defines how bytes map to characters and handles different ranges of the Unicode code point space.

# List available encodings
Encoding.list.size
# => 101

# Get specific encoding
Encoding::UTF_8
# => #<Encoding:UTF-8>

The String#force_encoding method changes a string's encoding designation without converting bytes, which differs from String#encode that actually transforms the byte sequence. This distinction is critical for handling binary data or fixing encoding mismatches.

Basic Usage

String creation automatically assigns encoding based on source file encoding or runtime default. Ruby defaults to UTF-8 for string literals in most environments.

# Default encoding assignment
default_string = "Hello"
default_string.encoding
# => #<Encoding:UTF-8>

# Explicit encoding with magic comment
# -*- coding: iso-8859-1 -*-
latin1_string = "café"
latin1_string.encoding
# => #<Encoding:ISO-8859-1>

The String#encode method converts between encodings, handling character mapping and byte sequence transformation. Options control behavior for characters that cannot be represented in the target encoding.

# Basic encoding conversion
original = "Héllo wørld"
ascii_version = original.encode("ASCII", invalid: :replace, undef: :replace)
# => "H?llo w?rld"

# Conversion with custom replacement
original.encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "H?llo w?rld"

Reading files requires careful encoding handling. Ruby attempts to detect encoding from byte order marks or defaults to the external encoding setting.

# Reading with explicit encoding
content = File.read("data.txt", encoding: "UTF-8")

# Reading binary then forcing encoding
binary_content = File.read("data.txt", mode: "rb")
text_content = binary_content.force_encoding("UTF-8")

String manipulation methods behave differently depending on encoding. Character-based operations count Unicode code points, while byte-based operations work with raw byte sequences.

multilingual = "English日本語Русский"

# Character operations
multilingual.length         # => 16 characters
multilingual[7, 3]         # => "日本語"
multilingual.upcase        # => "ENGLISH日本語РУССКИЙ"

# Byte operations  
multilingual.bytesize      # => 28 bytes
multilingual.bytes.size    # => 28

Regular expressions respect string encoding and match Unicode categories. The encoding of pattern and target string must be compatible.

text = "Price: $123.45"
# Match Unicode currency symbols
currency_pattern = /\p{Currency_Symbol}/u
text.scan(currency_pattern)
# => ["$"]

# Match Unicode word characters including non-ASCII
words = "Hello 世界 test".scan(/\p{Word}+/)
# => ["Hello", "世界", "test"]

String comparison operations consider encoding compatibility. Strings with different encodings but identical character content may not compare as equal without conversion.

utf8_text = "café".encode("UTF-8")
latin1_text = "café".encode("ISO-8859-1")

# Direct comparison fails
utf8_text == latin1_text
# => false

# Convert for comparison
utf8_text == latin1_text.encode("UTF-8")
# => true

Error Handling & Debugging

Ruby raises specific exceptions when encoding operations encounter problems. Encoding::InvalidByteSequenceError occurs when byte sequences don't form valid characters in the specified encoding.

# Invalid UTF-8 byte sequence
invalid_bytes = "\xFF\xFE".force_encoding("UTF-8")

begin
  invalid_bytes.encode("UTF-16")
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid sequence: #{e.error_char.dump}"
  puts "Position: #{e.source_encoding}"
end

Encoding::UndefinedConversionError appears when characters exist in source encoding but cannot be represented in target encoding.

# Unicode character not in ASCII
unicode_text = "Café éñ español"

begin
  ascii_result = unicode_text.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
  puts "Cannot convert: #{e.error_char} (U+#{e.error_char.ord.to_s(16).upcase})"
  puts "From: #{e.source_encoding} to #{e.destination_encoding}"
  
  # Retry with replacement
  ascii_result = unicode_text.encode("ASCII", undef: :replace, replace: "?")
end

The String#valid_encoding? method checks whether byte content forms valid characters according to the string's declared encoding.

# Valid UTF-8
valid_text = "Hello 世界"
valid_text.valid_encoding?
# => true

# Invalid UTF-8 bytes
invalid_utf8 = "\x80\x81".force_encoding("UTF-8")  
invalid_utf8.valid_encoding?
# => false

# Same bytes valid in different encoding
valid_binary = invalid_utf8.force_encoding("ASCII-8BIT")
valid_binary.valid_encoding?
# => true

Debugging encoding issues requires examining both declared encoding and actual byte content. The String#bytes method reveals raw byte values while String#codepoints shows Unicode code points.

problematic_text = "café"

puts "Encoding: #{problematic_text.encoding}"
puts "Bytes: #{problematic_text.bytes.map { |b| sprintf("%02X", b) }}"
puts "Codepoints: #{problematic_text.codepoints.map { |c| sprintf("U+%04X", c) }}"

# Detect common encoding problems
def diagnose_encoding(str)
  if str.encoding.name == "ASCII-8BIT" && str.bytes.any? { |b| b > 127 }
    "Likely UTF-8 data forced to binary encoding"
  elsif !str.valid_encoding?
    "Invalid byte sequence for declared encoding"
  elsif str.encoding.name == "UTF-8" && str.bytes.include?(0xEF, 0xBB, 0xBF)
    "Contains UTF-8 BOM"
  else
    "Encoding appears correct"
  end
end

Transcoding errors can be handled with fallback strategies. The :replace option substitutes problematic characters, while :ignore removes them entirely.

# Multiple error handling strategies
source_text = "Résumé with emoji 🚀 and symbols ™®"

strategies = {
  strict: { invalid: :replace, undef: :replace, replace: "" },
  loose: { invalid: :replace, undef: :replace, replace: "?" },
  ignore: { invalid: :ignore, undef: :ignore }
}

strategies.each do |name, options|
  begin
    result = source_text.encode("ASCII", **options)
    puts "#{name}: '#{result}'"
  rescue => e
    puts "#{name}: Error - #{e.class}"
  end
end

Common Pitfalls

ASCII-8BIT encoding creates frequent confusion because Ruby treats it as binary data rather than character data. String operations may produce unexpected results when ASCII-8BIT strings contain high-bit bytes.

# Binary string with UTF-8 bytes
binary_data = File.read("utf8_file.txt", mode: "rb")  # ASCII-8BIT encoding
binary_data.encoding
# => #<Encoding:ASCII-8BIT>

# Character operations don't work as expected
binary_data.upcase    # May corrupt UTF-8 sequences
binary_data.length    # Counts bytes, not characters

# Correct approach: force proper encoding first
text_data = binary_data.force_encoding("UTF-8")
if text_data.valid_encoding?
  text_data.upcase    # Now works correctly
end

Byte Order Marks (BOM) cause problems when present in UTF-8 strings. Many systems add BOM to UTF-8 files, but Ruby treats BOM bytes as regular characters.

# String with UTF-8 BOM
bom_string = "\uFEFF" + "Hello world"
bom_string.length
# => 12 (includes BOM character)

# BOM interferes with string operations
bom_string.start_with?("Hello")
# => false

# Remove BOM for processing
clean_string = bom_string.delete_prefix("\uFEFF")
clean_string.start_with?("Hello")  
# => true

# Check for BOM presence
def has_utf8_bom?(str)
  str.encoding == Encoding::UTF_8 && str.start_with?("\uFEFF")
end

Mixing strings with different encodings in operations like concatenation or comparison produces encoding compatibility errors.

utf8_string = "Hello".encode("UTF-8")
ascii_string = "World".encode("US-ASCII")

# This works - ASCII is subset of UTF-8
combined = utf8_string + " " + ascii_string
combined.encoding
# => #<Encoding:UTF-8>

# This fails with incompatible encodings
binary_string = "data".force_encoding("ASCII-8BIT")
begin
  result = utf8_string + binary_string
rescue Encoding::CompatibilityError => e
  puts "Cannot mix: #{e.message}"
  # Convert to compatible encoding
  result = utf8_string + binary_string.force_encoding("UTF-8")
end

Regular expressions inherit encoding from their source, which must be compatible with target strings. Pattern encoding mismatches cause runtime errors.

# UTF-8 pattern
pattern = /café/
utf8_text = "I love café au lait"

# Works fine
matches = utf8_text.scan(pattern)

# Fails with different encoding
latin1_text = utf8_text.encode("ISO-8859-1")
begin
  latin1_matches = latin1_text.scan(pattern)
rescue Encoding::CompatibilityError
  # Convert pattern to match text encoding  
  latin1_pattern = Regexp.new(pattern.source.encode("ISO-8859-1"))
  latin1_matches = latin1_text.scan(latin1_pattern)
end

Default external encoding affects file I/O and can cause data corruption if mismatched with actual file encoding. Always specify encoding explicitly when file encoding is known.

# Dangerous - relies on default encoding
content = File.read("japanese.txt")  # May corrupt data

# Safe - explicit encoding specification  
content = File.read("japanese.txt", encoding: "UTF-8")

# Check default encodings
puts "External: #{Encoding.default_external}"
puts "Internal: #{Encoding.default_internal}"

# Override defaults for specific operations
Encoding.default_external = "UTF-8"
Encoding.default_internal = "UTF-8"

Character normalization differences cause identical-looking strings to compare as unequal. Unicode allows multiple byte representations for the same visual characters.

# Two ways to represent é
composed = "café"      # é as single character (U+00E9)
decomposed = "cafe\u0301"  # e + combining acute (U+0065 U+0301)

# Look identical but compare as different
composed == decomposed
# => false

composed.length        # => 4
decomposed.length      # => 5

# Ruby doesn't include built-in normalization
# Manual normalization required for reliable comparison
def normalize_string(str)
  # Simple approach - would need full Unicode normalization library
  str.unicode_normalize(:nfc)
end

Performance & Memory

UTF-8 provides optimal performance for most text processing in Ruby because it serves as the default internal encoding. Operations on UTF-8 strings avoid conversion overhead that affects other encodings.

require 'benchmark'

text_samples = {
  utf8: "Hello 世界 testing performance with UTF-8",
  utf16: "Hello 世界 testing performance with UTF-8".encode("UTF-16LE"),  
  ascii: "Hello world testing performance with ASCII".encode("US-ASCII")
}

# Benchmark string operations
Benchmark.bmbm do |x|
  text_samples.each do |encoding, sample|
    x.report("#{encoding}_upcase") { 10_000.times { sample.upcase } }
    x.report("#{encoding}_scan") { 10_000.times { sample.scan(/\w+/) } }
  end
end

Memory usage varies significantly between encodings. UTF-32 uses 4 bytes per character regardless of character complexity, while UTF-8 uses 1-4 bytes per character based on Unicode code point range.

# Memory comparison for same text
base_text = "Mixed: ASCII + 中文 + العربية + 🌟"

encodings = ["UTF-8", "UTF-16LE", "UTF-32LE", "ASCII-8BIT"]
memory_usage = {}

encodings.each do |enc|
  encoded = base_text.encode(enc) rescue next
  memory_usage[enc] = {
    characters: encoded.length,
    bytes: encoded.bytesize,
    ratio: encoded.bytesize.to_f / encoded.length
  }
end

memory_usage.each do |enc, stats|
  puts "#{enc}: #{stats[:bytes]} bytes for #{stats[:characters]} chars (#{stats[:ratio].round(1)} bytes/char)"
end

Large text processing benefits from streaming approaches that avoid loading entire files into memory. Process text in chunks while maintaining encoding boundaries.

def process_large_file(filename, chunk_size = 8192)
  File.open(filename, "r:UTF-8") do |file|
    buffer = ""
    
    while chunk = file.read(chunk_size)
      buffer += chunk
      
      # Process complete lines to avoid splitting multi-byte characters
      lines = buffer.split("\n", -1)
      buffer = lines.pop || ""  # Keep incomplete line
      
      lines.each do |line|
        yield line if block_given?
      end
    end
    
    # Process remaining buffer
    yield buffer unless buffer.empty?
  end
end

# Usage with large file
process_large_file("large_unicode.txt") do |line|
  # Process line without loading entire file
  processed = line.upcase.gsub(/\s+/, " ")
end

Encoding conversion performance varies by source and target encodings. UTF-8 to UTF-16 conversion is faster than complex legacy encoding transformations.

# Performance comparison for different conversions
test_text = File.read("sample.txt", encoding: "UTF-8")

conversions = [
  ["UTF-8", "UTF-16LE"],
  ["UTF-8", "ISO-8859-1"],
  ["UTF-8", "Shift_JIS"],
  ["UTF-8", "UTF-32LE"]
]

Benchmark.bmbm do |x|
  conversions.each do |from, to|
    x.report("#{from}_to_#{to}") do
      1000.times { test_text.encode(to, invalid: :replace, undef: :replace) }
    end
  end
end

String interpolation and concatenation with mixed encodings triggers automatic conversion. Pre-converting all strings to the same encoding eliminates repeated conversion overhead.

# Inefficient - conversion happens repeatedly
mixed_strings = ["Hello", "世界".encode("UTF-16LE"), "test"]
result = ""
1000.times do
  mixed_strings.each { |s| result += s }  # Converts UTF-16LE each time
end

# Efficient - convert once
utf8_strings = mixed_strings.map { |s| s.encode("UTF-8") }
result = ""
1000.times do
  utf8_strings.each { |s| result += s }  # No conversion needed
end

Reference

Core Classes and Methods

Method Parameters Returns Description
String#encoding None Encoding Returns string's current encoding
String#encode(encoding, **opts) encoding (String/Encoding), options (Hash) String Convert string to different encoding
String#encode!(encoding, **opts) encoding (String/Encoding), options (Hash) String In-place encoding conversion
String#force_encoding(encoding) encoding (String/Encoding) String Change encoding without converting bytes
String#valid_encoding? None Boolean Check if bytes are valid for encoding
String#ascii_only? None Boolean True if all characters are ASCII
String#bytes None Array<Integer> Array of byte values
String#codepoints None Array<Integer> Array of Unicode code points
String#bytesize None Integer Number of bytes in string

Encoding Class Methods

Method Parameters Returns Description
Encoding.list None Array<Encoding> All available encodings
Encoding.find(name) name (String) Encoding Find encoding by name
Encoding.compatible?(str1, str2) str1, str2 (String) Encoding or nil Common encoding or nil
Encoding.default_external None Encoding Default for file I/O
Encoding.default_internal None Encoding or nil Default conversion target

Encoding Conversion Options

Option Values Description
:invalid :replace, :ignore Handle invalid byte sequences
:undef :replace, :ignore Handle undefined character conversions
:replace String Replacement string for problematic characters
:fallback Hash or Proc Custom character mapping
:xml :text, :attr XML-specific escaping rules
:cr_newline Boolean Convert CR to newline
:crlf_newline Boolean Convert CRLF to newline
:universal_newline Boolean Convert all newlines to LF

Common Encodings

Encoding Description Bytes per Character Use Case
UTF-8 Unicode 8-bit 1-4 Default, web content, modern files
UTF-16LE Unicode 16-bit little endian 2-4 Windows systems, some databases
UTF-32LE Unicode 32-bit little endian 4 Fixed-width Unicode processing
US-ASCII 7-bit ASCII 1 Legacy systems, protocol headers
ASCII-8BIT Binary data 1 Binary files, network protocols
ISO-8859-1 Latin-1 1 Legacy European text
Windows-1252 Windows Latin 1 Windows legacy files
Shift_JIS Japanese 1-2 Japanese legacy systems

Exception Classes

Exception Trigger Common Cause
Encoding::InvalidByteSequenceError Invalid bytes for encoding Corrupted data, wrong encoding
Encoding::UndefinedConversionError Character not in target encoding Unicode to ASCII conversion
Encoding::CompatibilityError Incompatible encodings Mixed encoding operations
Encoding::ConverterNotFoundError Unknown encoding conversion Typo in encoding name

Regular Expression Encoding Flags

Flag Description
//u UTF-8 encoding
//e EUC-JP encoding
//s Windows-31J encoding
//n ASCII-8BIT encoding

File I/O Encoding Options

# Reading with encoding
File.read("file.txt", encoding: "UTF-8")
File.read("file.txt", encoding: "UTF-8:UTF-16LE")  # external:internal

# Writing with encoding  
File.write("file.txt", data, encoding: "UTF-8")

# Open with encoding
File.open("file.txt", "r:UTF-8:UTF-16LE") do |f|
  # Read UTF-8, convert to UTF-16LE
end