Overview
Ruby provides Unicode support through its string encoding system, which handles character data in various encodings including UTF-8, UTF-16, ASCII, and dozens of other character sets. The core of Ruby's Unicode support centers on two main classes: String
and Encoding
.
Every string in Ruby has an associated encoding that determines how its bytes are interpreted as characters. Ruby distinguishes between a string's actual byte content and its declared encoding, allowing for explicit control over text processing operations.
# String with UTF-8 encoding
text = "Hello, 世界"
text.encoding
# => #<Encoding:UTF-8>
# Check character vs byte count
text.length # => 9 characters
text.bytesize # => 13 bytes
The String#encode
method performs encoding conversion, transforming text from one character set to another. This method handles the complex process of mapping characters between different Unicode representations and legacy encodings.
# Convert UTF-8 to UTF-16
utf16_text = "Café".encode("UTF-16LE")
utf16_text.bytes
# => [67, 0, 97, 0, 102, 0, 233, 0]
Ruby's encoding system supports over 100 different character encodings, accessible through the Encoding
class. Each encoding defines how bytes map to characters and handles different ranges of the Unicode code point space.
# List available encodings
Encoding.list.size
# => 101
# Get specific encoding
Encoding::UTF_8
# => #<Encoding:UTF-8>
The String#force_encoding
method changes a string's encoding designation without converting bytes, which differs from String#encode
that actually transforms the byte sequence. This distinction is critical for handling binary data or fixing encoding mismatches.
Basic Usage
String creation automatically assigns encoding based on source file encoding or runtime default. Ruby defaults to UTF-8 for string literals in most environments.
# Default encoding assignment
default_string = "Hello"
default_string.encoding
# => #<Encoding:UTF-8>
# Explicit encoding with magic comment
# -*- coding: iso-8859-1 -*-
latin1_string = "café"
latin1_string.encoding
# => #<Encoding:ISO-8859-1>
The String#encode
method converts between encodings, handling character mapping and byte sequence transformation. Options control behavior for characters that cannot be represented in the target encoding.
# Basic encoding conversion
original = "Héllo wørld"
ascii_version = original.encode("ASCII", invalid: :replace, undef: :replace)
# => "H?llo w?rld"
# Conversion with custom replacement
original.encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "H?llo w?rld"
Reading files requires careful encoding handling. Ruby attempts to detect encoding from byte order marks or defaults to the external encoding setting.
# Reading with explicit encoding
content = File.read("data.txt", encoding: "UTF-8")
# Reading binary then forcing encoding
binary_content = File.read("data.txt", mode: "rb")
text_content = binary_content.force_encoding("UTF-8")
String manipulation methods behave differently depending on encoding. Character-based operations count Unicode code points, while byte-based operations work with raw byte sequences.
multilingual = "English日本語Русский"
# Character operations
multilingual.length # => 16 characters
multilingual[7, 3] # => "日本語"
multilingual.upcase # => "ENGLISH日本語РУССКИЙ"
# Byte operations
multilingual.bytesize # => 28 bytes
multilingual.bytes.size # => 28
Regular expressions respect string encoding and match Unicode categories. The encoding of pattern and target string must be compatible.
text = "Price: $123.45"
# Match Unicode currency symbols
currency_pattern = /\p{Currency_Symbol}/u
text.scan(currency_pattern)
# => ["$"]
# Match Unicode word characters including non-ASCII
words = "Hello 世界 test".scan(/\p{Word}+/)
# => ["Hello", "世界", "test"]
String comparison operations consider encoding compatibility. Strings with different encodings but identical character content may not compare as equal without conversion.
utf8_text = "café".encode("UTF-8")
latin1_text = "café".encode("ISO-8859-1")
# Direct comparison fails
utf8_text == latin1_text
# => false
# Convert for comparison
utf8_text == latin1_text.encode("UTF-8")
# => true
Error Handling & Debugging
Ruby raises specific exceptions when encoding operations encounter problems. Encoding::InvalidByteSequenceError
occurs when byte sequences don't form valid characters in the specified encoding.
# Invalid UTF-8 byte sequence
invalid_bytes = "\xFF\xFE".force_encoding("UTF-8")
begin
invalid_bytes.encode("UTF-16")
rescue Encoding::InvalidByteSequenceError => e
puts "Invalid sequence: #{e.error_char.dump}"
puts "Position: #{e.source_encoding}"
end
Encoding::UndefinedConversionError
appears when characters exist in source encoding but cannot be represented in target encoding.
# Unicode character not in ASCII
unicode_text = "Café éñ español"
begin
ascii_result = unicode_text.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
puts "Cannot convert: #{e.error_char} (U+#{e.error_char.ord.to_s(16).upcase})"
puts "From: #{e.source_encoding} to #{e.destination_encoding}"
# Retry with replacement
ascii_result = unicode_text.encode("ASCII", undef: :replace, replace: "?")
end
The String#valid_encoding?
method checks whether byte content forms valid characters according to the string's declared encoding.
# Valid UTF-8
valid_text = "Hello 世界"
valid_text.valid_encoding?
# => true
# Invalid UTF-8 bytes
invalid_utf8 = "\x80\x81".force_encoding("UTF-8")
invalid_utf8.valid_encoding?
# => false
# Same bytes valid in different encoding
valid_binary = invalid_utf8.force_encoding("ASCII-8BIT")
valid_binary.valid_encoding?
# => true
Debugging encoding issues requires examining both declared encoding and actual byte content. The String#bytes
method reveals raw byte values while String#codepoints
shows Unicode code points.
problematic_text = "café"
puts "Encoding: #{problematic_text.encoding}"
puts "Bytes: #{problematic_text.bytes.map { |b| sprintf("%02X", b) }}"
puts "Codepoints: #{problematic_text.codepoints.map { |c| sprintf("U+%04X", c) }}"
# Detect common encoding problems
def diagnose_encoding(str)
if str.encoding.name == "ASCII-8BIT" && str.bytes.any? { |b| b > 127 }
"Likely UTF-8 data forced to binary encoding"
elsif !str.valid_encoding?
"Invalid byte sequence for declared encoding"
elsif str.encoding.name == "UTF-8" && str.bytes.include?(0xEF, 0xBB, 0xBF)
"Contains UTF-8 BOM"
else
"Encoding appears correct"
end
end
Transcoding errors can be handled with fallback strategies. The :replace
option substitutes problematic characters, while :ignore
removes them entirely.
# Multiple error handling strategies
source_text = "Résumé with emoji 🚀 and symbols ™®"
strategies = {
strict: { invalid: :replace, undef: :replace, replace: "" },
loose: { invalid: :replace, undef: :replace, replace: "?" },
ignore: { invalid: :ignore, undef: :ignore }
}
strategies.each do |name, options|
begin
result = source_text.encode("ASCII", **options)
puts "#{name}: '#{result}'"
rescue => e
puts "#{name}: Error - #{e.class}"
end
end
Common Pitfalls
ASCII-8BIT encoding creates frequent confusion because Ruby treats it as binary data rather than character data. String operations may produce unexpected results when ASCII-8BIT strings contain high-bit bytes.
# Binary string with UTF-8 bytes
binary_data = File.read("utf8_file.txt", mode: "rb") # ASCII-8BIT encoding
binary_data.encoding
# => #<Encoding:ASCII-8BIT>
# Character operations don't work as expected
binary_data.upcase # May corrupt UTF-8 sequences
binary_data.length # Counts bytes, not characters
# Correct approach: force proper encoding first
text_data = binary_data.force_encoding("UTF-8")
if text_data.valid_encoding?
text_data.upcase # Now works correctly
end
Byte Order Marks (BOM) cause problems when present in UTF-8 strings. Many systems add BOM to UTF-8 files, but Ruby treats BOM bytes as regular characters.
# String with UTF-8 BOM
bom_string = "\uFEFF" + "Hello world"
bom_string.length
# => 12 (includes BOM character)
# BOM interferes with string operations
bom_string.start_with?("Hello")
# => false
# Remove BOM for processing
clean_string = bom_string.delete_prefix("\uFEFF")
clean_string.start_with?("Hello")
# => true
# Check for BOM presence
def has_utf8_bom?(str)
str.encoding == Encoding::UTF_8 && str.start_with?("\uFEFF")
end
Mixing strings with different encodings in operations like concatenation or comparison produces encoding compatibility errors.
utf8_string = "Hello".encode("UTF-8")
ascii_string = "World".encode("US-ASCII")
# This works - ASCII is subset of UTF-8
combined = utf8_string + " " + ascii_string
combined.encoding
# => #<Encoding:UTF-8>
# This fails with incompatible encodings
binary_string = "data".force_encoding("ASCII-8BIT")
begin
result = utf8_string + binary_string
rescue Encoding::CompatibilityError => e
puts "Cannot mix: #{e.message}"
# Convert to compatible encoding
result = utf8_string + binary_string.force_encoding("UTF-8")
end
Regular expressions inherit encoding from their source, which must be compatible with target strings. Pattern encoding mismatches cause runtime errors.
# UTF-8 pattern
pattern = /café/
utf8_text = "I love café au lait"
# Works fine
matches = utf8_text.scan(pattern)
# Fails with different encoding
latin1_text = utf8_text.encode("ISO-8859-1")
begin
latin1_matches = latin1_text.scan(pattern)
rescue Encoding::CompatibilityError
# Convert pattern to match text encoding
latin1_pattern = Regexp.new(pattern.source.encode("ISO-8859-1"))
latin1_matches = latin1_text.scan(latin1_pattern)
end
Default external encoding affects file I/O and can cause data corruption if mismatched with actual file encoding. Always specify encoding explicitly when file encoding is known.
# Dangerous - relies on default encoding
content = File.read("japanese.txt") # May corrupt data
# Safe - explicit encoding specification
content = File.read("japanese.txt", encoding: "UTF-8")
# Check default encodings
puts "External: #{Encoding.default_external}"
puts "Internal: #{Encoding.default_internal}"
# Override defaults for specific operations
Encoding.default_external = "UTF-8"
Encoding.default_internal = "UTF-8"
Character normalization differences cause identical-looking strings to compare as unequal. Unicode allows multiple byte representations for the same visual characters.
# Two ways to represent é
composed = "café" # é as single character (U+00E9)
decomposed = "cafe\u0301" # e + combining acute (U+0065 U+0301)
# Look identical but compare as different
composed == decomposed
# => false
composed.length # => 4
decomposed.length # => 5
# Ruby doesn't include built-in normalization
# Manual normalization required for reliable comparison
def normalize_string(str)
# Simple approach - would need full Unicode normalization library
str.unicode_normalize(:nfc)
end
Performance & Memory
UTF-8 provides optimal performance for most text processing in Ruby because it serves as the default internal encoding. Operations on UTF-8 strings avoid conversion overhead that affects other encodings.
require 'benchmark'
text_samples = {
utf8: "Hello 世界 testing performance with UTF-8",
utf16: "Hello 世界 testing performance with UTF-8".encode("UTF-16LE"),
ascii: "Hello world testing performance with ASCII".encode("US-ASCII")
}
# Benchmark string operations
Benchmark.bmbm do |x|
text_samples.each do |encoding, sample|
x.report("#{encoding}_upcase") { 10_000.times { sample.upcase } }
x.report("#{encoding}_scan") { 10_000.times { sample.scan(/\w+/) } }
end
end
Memory usage varies significantly between encodings. UTF-32 uses 4 bytes per character regardless of character complexity, while UTF-8 uses 1-4 bytes per character based on Unicode code point range.
# Memory comparison for same text
base_text = "Mixed: ASCII + 中文 + العربية + 🌟"
encodings = ["UTF-8", "UTF-16LE", "UTF-32LE", "ASCII-8BIT"]
memory_usage = {}
encodings.each do |enc|
encoded = base_text.encode(enc) rescue next
memory_usage[enc] = {
characters: encoded.length,
bytes: encoded.bytesize,
ratio: encoded.bytesize.to_f / encoded.length
}
end
memory_usage.each do |enc, stats|
puts "#{enc}: #{stats[:bytes]} bytes for #{stats[:characters]} chars (#{stats[:ratio].round(1)} bytes/char)"
end
Large text processing benefits from streaming approaches that avoid loading entire files into memory. Process text in chunks while maintaining encoding boundaries.
def process_large_file(filename, chunk_size = 8192)
File.open(filename, "r:UTF-8") do |file|
buffer = ""
while chunk = file.read(chunk_size)
buffer += chunk
# Process complete lines to avoid splitting multi-byte characters
lines = buffer.split("\n", -1)
buffer = lines.pop || "" # Keep incomplete line
lines.each do |line|
yield line if block_given?
end
end
# Process remaining buffer
yield buffer unless buffer.empty?
end
end
# Usage with large file
process_large_file("large_unicode.txt") do |line|
# Process line without loading entire file
processed = line.upcase.gsub(/\s+/, " ")
end
Encoding conversion performance varies by source and target encodings. UTF-8 to UTF-16 conversion is faster than complex legacy encoding transformations.
# Performance comparison for different conversions
test_text = File.read("sample.txt", encoding: "UTF-8")
conversions = [
["UTF-8", "UTF-16LE"],
["UTF-8", "ISO-8859-1"],
["UTF-8", "Shift_JIS"],
["UTF-8", "UTF-32LE"]
]
Benchmark.bmbm do |x|
conversions.each do |from, to|
x.report("#{from}_to_#{to}") do
1000.times { test_text.encode(to, invalid: :replace, undef: :replace) }
end
end
end
String interpolation and concatenation with mixed encodings triggers automatic conversion. Pre-converting all strings to the same encoding eliminates repeated conversion overhead.
# Inefficient - conversion happens repeatedly
mixed_strings = ["Hello", "世界".encode("UTF-16LE"), "test"]
result = ""
1000.times do
mixed_strings.each { |s| result += s } # Converts UTF-16LE each time
end
# Efficient - convert once
utf8_strings = mixed_strings.map { |s| s.encode("UTF-8") }
result = ""
1000.times do
utf8_strings.each { |s| result += s } # No conversion needed
end
Reference
Core Classes and Methods
Method | Parameters | Returns | Description |
---|---|---|---|
String#encoding |
None | Encoding |
Returns string's current encoding |
String#encode(encoding, **opts) |
encoding (String/Encoding), options (Hash) | String |
Convert string to different encoding |
String#encode!(encoding, **opts) |
encoding (String/Encoding), options (Hash) | String |
In-place encoding conversion |
String#force_encoding(encoding) |
encoding (String/Encoding) | String |
Change encoding without converting bytes |
String#valid_encoding? |
None | Boolean |
Check if bytes are valid for encoding |
String#ascii_only? |
None | Boolean |
True if all characters are ASCII |
String#bytes |
None | Array<Integer> |
Array of byte values |
String#codepoints |
None | Array<Integer> |
Array of Unicode code points |
String#bytesize |
None | Integer |
Number of bytes in string |
Encoding Class Methods
Method | Parameters | Returns | Description |
---|---|---|---|
Encoding.list |
None | Array<Encoding> |
All available encodings |
Encoding.find(name) |
name (String) | Encoding |
Find encoding by name |
Encoding.compatible?(str1, str2) |
str1, str2 (String) | Encoding or nil |
Common encoding or nil |
Encoding.default_external |
None | Encoding |
Default for file I/O |
Encoding.default_internal |
None | Encoding or nil |
Default conversion target |
Encoding Conversion Options
Option | Values | Description |
---|---|---|
:invalid |
:replace , :ignore |
Handle invalid byte sequences |
:undef |
:replace , :ignore |
Handle undefined character conversions |
:replace |
String | Replacement string for problematic characters |
:fallback |
Hash or Proc | Custom character mapping |
:xml |
:text , :attr |
XML-specific escaping rules |
:cr_newline |
Boolean | Convert CR to newline |
:crlf_newline |
Boolean | Convert CRLF to newline |
:universal_newline |
Boolean | Convert all newlines to LF |
Common Encodings
Encoding | Description | Bytes per Character | Use Case |
---|---|---|---|
UTF-8 |
Unicode 8-bit | 1-4 | Default, web content, modern files |
UTF-16LE |
Unicode 16-bit little endian | 2-4 | Windows systems, some databases |
UTF-32LE |
Unicode 32-bit little endian | 4 | Fixed-width Unicode processing |
US-ASCII |
7-bit ASCII | 1 | Legacy systems, protocol headers |
ASCII-8BIT |
Binary data | 1 | Binary files, network protocols |
ISO-8859-1 |
Latin-1 | 1 | Legacy European text |
Windows-1252 |
Windows Latin | 1 | Windows legacy files |
Shift_JIS |
Japanese | 1-2 | Japanese legacy systems |
Exception Classes
Exception | Trigger | Common Cause |
---|---|---|
Encoding::InvalidByteSequenceError |
Invalid bytes for encoding | Corrupted data, wrong encoding |
Encoding::UndefinedConversionError |
Character not in target encoding | Unicode to ASCII conversion |
Encoding::CompatibilityError |
Incompatible encodings | Mixed encoding operations |
Encoding::ConverterNotFoundError |
Unknown encoding conversion | Typo in encoding name |
Regular Expression Encoding Flags
Flag | Description |
---|---|
//u |
UTF-8 encoding |
//e |
EUC-JP encoding |
//s |
Windows-31J encoding |
//n |
ASCII-8BIT encoding |
File I/O Encoding Options
# Reading with encoding
File.read("file.txt", encoding: "UTF-8")
File.read("file.txt", encoding: "UTF-8:UTF-16LE") # external:internal
# Writing with encoding
File.write("file.txt", data, encoding: "UTF-8")
# Open with encoding
File.open("file.txt", "r:UTF-8:UTF-16LE") do |f|
# Read UTF-8, convert to UTF-16LE
end