Overview
Ruby handles text encoding through the Encoding class and String methods that convert character data between different encoding formats. The encoding system operates on byte sequences, transforming them from one character representation to another while preserving the textual meaning.
String objects in Ruby carry encoding information internally. The String#encoding
method returns the current encoding, while String#encode
performs the actual conversion. Ruby supports over 100 encodings, including UTF-8, ASCII, ISO-8859-1, Windows-1252, and various Asian encodings like Shift_JIS and EUC-JP.
text = "Hello, 世界"
text.encoding
# => #<Encoding:UTF-8>
text.encode('ISO-8859-1', undef: :replace)
# => "Hello, ????"
The encoding conversion process involves three key components: the source encoding (current string encoding), the target encoding (desired output encoding), and conversion options that control error handling and replacement characters.
# Check available encodings
Encoding.list.first(5)
# => [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, #<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>]
# Get encoding by name
Encoding.find('UTF-8')
# => #<Encoding:UTF-8>
Ruby's encoding system distinguishes between compatible and incompatible encodings. Compatible encodings can represent the same characters, while incompatible conversions may result in data loss or require replacement strategies.
ascii_text = "Hello"
utf8_text = ascii_text.encode('UTF-8')
ascii_text.encoding.compatible?(utf8_text.encoding)
# => #<Encoding:UTF-8>
Basic Usage
The primary method for encoding conversion is String#encode
, which accepts a target encoding and optional parameters. The method returns a new string with the specified encoding, leaving the original string unchanged.
original = "Café"
converted = original.encode('ASCII', undef: :replace, replace: '?')
# => "Caf?"
original.encoding
# => #<Encoding:UTF-8>
converted.encoding
# => #<Encoding:US-ASCII>
Ruby accepts encoding specifications as strings, symbols, or Encoding objects. The conversion behavior changes based on whether characters exist in the target encoding.
text = "München"
# String notation
text.encode('ISO-8859-1')
# => "München"
# Symbol notation
text.encode(:ascii, undef: :replace)
# => "M?nchen"
# Encoding object
text.encode(Encoding::UTF_16)
# => "\xFF\xFEM\x00\xFC\x00n\x00c\x00h\x00e\x00n\x00"
The force_encoding
method changes the encoding label without converting bytes, while encode
performs actual character transformation. This distinction affects how Ruby interprets byte sequences.
bytes = "\xC3\xBC" # UTF-8 bytes for 'ü'
bytes.encoding
# => #<Encoding:ASCII-8BIT>
# Force encoding interpretation
utf8_string = bytes.force_encoding('UTF-8')
utf8_string
# => "ü"
# Convert to different encoding
latin1_string = utf8_string.encode('ISO-8859-1')
latin1_string.bytes
# => [252]
Conversion options control how Ruby handles characters that don't exist in the target encoding. The undef
parameter specifies behavior for undefined characters, while replace
sets the replacement string.
japanese = "こんにちは"
# Default: raise exception
japanese.encode('ASCII') rescue 'error'
# => "error"
# Replace undefined characters
japanese.encode('ASCII', undef: :replace, replace: '[?]')
# => "[?][?][?][?][?]"
# Skip undefined characters
japanese.encode('ASCII', undef: :replace, replace: '')
# => ""
Error Handling & Debugging
Encoding conversions raise specific exceptions when characters cannot be represented in the target encoding or when byte sequences are invalid. Ruby provides several exception classes for different encoding problems.
Encoding::UndefinedConversionError
occurs when source characters don't exist in the target encoding. This exception includes details about the problematic character and its position.
begin
"Naïve".encode('ASCII')
rescue Encoding::UndefinedConversionError => e
puts "Character: #{e.error_char.inspect}"
puts "Source encoding: #{e.source_encoding}"
puts "Destination encoding: #{e.destination_encoding}"
end
# Character: "ï"
# Source encoding: UTF-8
# Destination encoding: US-ASCII
Encoding::InvalidByteSequenceError
raises when the source string contains byte sequences that are invalid for its declared encoding. This commonly happens with binary data incorrectly labeled as text.
binary_data = "\xFF\xFE\x00\x41"
binary_data.force_encoding('UTF-8')
begin
binary_data.encode('ASCII')
rescue Encoding::InvalidByteSequenceError => e
puts "Invalid bytes at position: #{e.source_buffer.bytesize}"
puts "Error byte: #{e.error_bytes.unpack('H*')}"
end
The valid_encoding?
method checks if a string's bytes match its declared encoding without raising exceptions. This allows pre-validation before conversion attempts.
def safe_encode(text, target_encoding)
unless text.valid_encoding?
return nil, "Invalid byte sequence for #{text.encoding}"
end
begin
converted = text.encode(target_encoding)
[converted, nil]
rescue Encoding::UndefinedConversionError => e
[nil, "Cannot convert '#{e.error_char}' to #{target_encoding}"]
rescue Encoding::InvalidByteSequenceError => e
[nil, "Invalid byte sequence: #{e.error_bytes.unpack('H*')}"]
end
end
result, error = safe_encode("Café", 'ASCII')
# => [nil, "Cannot convert 'é' to US-ASCII"]
Ruby provides fallback encodings and replacement strategies for robust error handling. The fallback
option accepts a hash mapping problematic characters to replacements.
# Custom character replacements
replacements = {
'é' => 'e',
'ñ' => 'n',
'©' => '(c)'
}
"Café ñoño ©2023".encode('ASCII', fallback: replacements)
# => "Cafe nono (c)2023"
Debugging encoding issues requires examining byte sequences and encoding labels separately. Ruby's inspect
method shows both the string content and encoding information.
def debug_encoding(str)
puts "String: #{str.inspect}"
puts "Encoding: #{str.encoding}"
puts "Bytes: #{str.bytes.map { |b| sprintf('%02X', b) }.join(' ')}"
puts "Valid: #{str.valid_encoding?}"
puts "ASCII only: #{str.ascii_only?}"
end
debug_encoding("test\xFF".force_encoding('UTF-8'))
# String: "test\xFF"
# Encoding: UTF-8
# Bytes: 74 65 73 74 FF
# Valid: false
# ASCII only: false
Performance & Memory
Encoding conversions involve memory allocation and character transformation algorithms that vary significantly in performance based on source and target encodings. ASCII-compatible encodings convert faster than multi-byte encodings, while transcoding between different multi-byte formats requires more processing.
Single-byte to single-byte conversions perform fastest, as Ruby can process bytes directly without complex character boundary detection. Multi-byte conversions require parsing character boundaries and may involve lookup tables for character mapping.
require 'benchmark'
ascii_text = "Hello World" * 1000
utf8_text = "Hello 世界" * 1000
complex_text = File.read('large_file.txt', encoding: 'UTF-8')
Benchmark.bmbm do |x|
x.report("ASCII to Latin1") { ascii_text.encode('ISO-8859-1') }
x.report("UTF-8 to UTF-16") { utf8_text.encode('UTF-16') }
x.report("UTF-8 to ASCII") { complex_text.encode('ASCII', undef: :replace) }
end
Memory usage patterns differ between encoding types. UTF-16 conversions double memory requirements for ASCII-compatible text, while UTF-8 to ASCII conversions may reduce memory usage by removing multi-byte sequences.
original = "Mixed text with émojis 🎉" * 10000
# Memory efficient: same character count, different byte representation
ascii_version = original.encode('ASCII', undef: :replace, replace: '?')
puts "Original bytes: #{original.bytesize}"
puts "ASCII bytes: #{ascii_version.bytesize}"
puts "Character count: #{original.length} vs #{ascii_version.length}"
Large file processing benefits from streaming conversions rather than loading entire files into memory. Ruby's IO methods accept encoding parameters for automatic conversion during read operations.
# Memory efficient file processing
def convert_large_file(input_path, output_path, target_encoding)
File.open(output_path, 'w', encoding: target_encoding) do |output|
File.open(input_path, 'r', encoding: 'UTF-8') do |input|
input.each_line do |line|
begin
converted_line = line.encode(target_encoding, undef: :replace)
output.write(converted_line)
rescue Encoding::InvalidByteSequenceError
# Skip invalid lines or handle specifically
next
end
end
end
end
end
Conversion performance varies with character distribution. Text containing only ASCII characters converts quickly to any ASCII-compatible encoding, while text with many Unicode characters requires more processing time.
def performance_comparison(text, iterations = 10000)
encodings = ['ASCII', 'ISO-8859-1', 'UTF-16', 'Shift_JIS']
encodings.each do |target|
time = Benchmark.realtime do
iterations.times do
text.encode(target, undef: :replace) rescue nil
end
end
puts "#{target}: #{(time * 1000).round(2)}ms for #{iterations} conversions"
end
end
performance_comparison("Hello World") # ASCII only
performance_comparison("こんにちは") # Multi-byte characters
Common Pitfalls
The force_encoding
method changes encoding labels without converting bytes, leading to corrupted text when used incorrectly. Developers often confuse this with actual encoding conversion.
# WRONG: This corrupts the text
utf8_bytes = "Café".bytes # [67, 97, 102, 195, 169]
binary_string = utf8_bytes.pack('C*').force_encoding('ASCII')
binary_string
# => "Caf\xC3\xA9"
# CORRECT: Use encode for conversion
binary_string.force_encoding('UTF-8').encode('ASCII', undef: :replace)
# => "Caf?"
Binary data mixed with text causes encoding errors when Ruby interprets non-text bytes as characters. This commonly occurs when reading files that contain both text and binary sections.
# File contains text followed by binary data
mixed_content = "Header: Info\n\x00\x01\x02\xFF\xFE"
# This will fail
mixed_content.force_encoding('UTF-8').encode('ASCII') rescue 'encoding error'
# => "encoding error"
# Solution: separate text and binary portions
text_portion = mixed_content.split("\x00").first
text_portion.encode('ASCII', undef: :replace)
# => "Header: Info\n"
Default encoding assumptions cause problems when system default encodings differ from file encodings. Ruby uses system locale settings that may not match file content.
# System default might be ASCII
Encoding.default_external
# => #<Encoding:US-ASCII>
# But file contains UTF-8
# This creates encoding mismatches
file_content = File.read('utf8_file.txt') # Assumes system default
file_content.valid_encoding?
# => false
# Always specify encoding explicitly
file_content = File.read('utf8_file.txt', encoding: 'UTF-8')
file_content.valid_encoding?
# => true
Encoding conversion loses information when target encodings cannot represent source characters. This data loss is permanent and cannot be reversed without the original text.
original = "Résumé with naïve café"
ascii_version = original.encode('ASCII', undef: :replace, replace: '?')
# => "R?sum? with na?ve caf?"
# Cannot recover original characters
ascii_version.encode('UTF-8')
# => "R?sum? with na?ve caf?" # Still has question marks
String concatenation between different encodings follows complex compatibility rules that can produce unexpected results or raise exceptions.
ascii_str = "Hello".encode('ASCII')
utf8_str = " 世界".encode('UTF-8')
# This works: ASCII is UTF-8 compatible
result = ascii_str + utf8_str
result.encoding
# => #<Encoding:UTF-8>
# This fails: incompatible encodings
latin1_str = "café".encode('ISO-8859-1')
utf16_str = "world".encode('UTF-16')
latin1_str + utf16_str rescue 'incompatible encodings'
# => "incompatible encodings"
Regular expressions operate on encoding-aware character boundaries, not byte boundaries. Pattern matching can fail when encoding assumptions don't match string content.
# Pattern assumes UTF-8, but string is Latin1
pattern = /café/
latin1_string = "café".encode('ISO-8859-1')
# This may not match as expected
pattern.match(latin1_string)
# Depends on the specific characters and encodings involved
# Convert to matching encoding first
pattern.match(latin1_string.encode('UTF-8'))
# => #<MatchData "café">
Reference
Core Classes and Methods
Class/Method | Purpose | Returns |
---|---|---|
Encoding |
Represents character encoding | Encoding object |
String#encoding |
Get string's encoding | Encoding object |
String#encode(encoding, **opts) |
Convert to target encoding | New string |
String#force_encoding(encoding) |
Change encoding label | Self (modified) |
String#valid_encoding? |
Check encoding validity | Boolean |
String#ascii_only? |
Check if ASCII characters only | Boolean |
Encoding Detection and Conversion Options
Method | Parameters | Returns | Description |
---|---|---|---|
Encoding.find(name) |
name (String/Symbol) |
Encoding |
Get encoding by name |
Encoding.list |
None | Array<Encoding> |
All available encodings |
Encoding.compatible?(str1, str2) |
Two objects | Encoding or nil |
Common compatible encoding |
String#encode(to, **opts) |
to (Encoding), options (Hash) |
String |
Convert encoding with options |
Conversion Options
Option | Values | Default | Description |
---|---|---|---|
undef |
:raise , :replace |
:raise |
Undefined character handling |
invalid |
:raise , :replace |
:raise |
Invalid byte sequence handling |
replace |
String | "?" |
Replacement character/string |
fallback |
Hash, Proc | nil |
Custom character mappings |
xml |
:text , :attr |
nil |
XML entity encoding |
cr_newline |
Boolean | false |
Convert CR to LF |
crlf_newline |
Boolean | false |
Convert CRLF to LF |
universal_newline |
Boolean | false |
Convert any newline to LF |
Exception Hierarchy
EncodingError
├── Encoding::UndefinedConversionError
├── Encoding::InvalidByteSequenceError
├── Encoding::ConverterNotFoundError
└── Encoding::CompatibilityError
Common Encodings
Encoding Name | Aliases | Character Set | Use Case |
---|---|---|---|
UTF-8 |
None | Unicode | Web, modern text |
ASCII |
US-ASCII |
7-bit ASCII | Legacy systems |
ISO-8859-1 |
Latin1 |
Latin alphabet | Western European |
Windows-1252 |
CP1252 |
Extended Latin | Windows systems |
UTF-16 |
None | Unicode | Windows APIs |
Shift_JIS |
SJIS |
Japanese | Japanese systems |
EUC-JP |
None | Japanese | Unix Japanese |
ASCII-8BIT |
BINARY |
Raw bytes | Binary data |
String Method Quick Reference
# Encoding information
str.encoding # Current encoding
str.valid_encoding? # Validity check
str.ascii_only? # ASCII character check
# Encoding conversion
str.encode(target) # Basic conversion
str.encode(target, undef: :replace) # Replace undefined chars
str.encode(target, invalid: :replace) # Replace invalid bytes
str.force_encoding(encoding) # Change label only
# Byte operations
str.bytes # Array of byte values
str.bytesize # Size in bytes
str.length # Size in characters
Error Handling Patterns
# Exception handling
begin
converted = str.encode(target)
rescue Encoding::UndefinedConversionError => e
# Handle undefined characters
retry_with_replace = str.encode(target, undef: :replace)
rescue Encoding::InvalidByteSequenceError => e
# Handle invalid byte sequences
str = str.scrub # Remove invalid sequences
retry
end
# Validation before conversion
if str.valid_encoding?
converted = str.encode(target, undef: :replace)
else
cleaned = str.scrub
converted = cleaned.encode(target, undef: :replace)
end
File I/O with Encodings
# Read with specific encoding
File.read('file.txt', encoding: 'UTF-8')
# Write with encoding conversion
File.write('output.txt', content, encoding: 'ISO-8859-1')
# Open with encoding specification
File.open('file.txt', 'r:UTF-8:ASCII') do |f|
# Reads UTF-8, converts to ASCII
end