Overview
String encoding in Ruby controls how text characters map to byte sequences in memory and storage. Ruby represents strings as sequences of bytes combined with an encoding object that defines how to interpret those bytes as characters. Every string carries encoding metadata that determines character boundaries, case conversion rules, and validity constraints.
The Encoding
class provides the foundation for all encoding operations. Ruby includes over 100 built-in encodings, from ASCII and UTF-8 to legacy formats like Windows-1252 and ISO-8859-1. The String#encoding
method returns the encoding object, while String#encode
converts between encodings.
str = "Hello, 世界"
str.encoding # => #<Encoding:UTF-8>
str.bytesize # => 13 (bytes in memory)
str.size # => 9 (logical characters)
Ruby strings exist in one of three validity states. Valid strings contain only byte sequences that form complete characters in their declared encoding. Broken strings contain invalid byte sequences but Ruby can still perform many operations on them. Incomplete strings end with a partial character sequence.
# Valid UTF-8
valid = "café".force_encoding('UTF-8')
valid.valid_encoding? # => true
# Broken encoding - invalid UTF-8 bytes
broken = "\xFF\xFE".force_encoding('UTF-8')
broken.valid_encoding? # => false
# String operations still work
broken.upcase # => "\xFF\xFE"
The default source encoding for Ruby files is UTF-8. Ruby automatically handles encoding for string literals, but external data requires explicit encoding management. File operations, network communication, and database interactions all involve encoding decisions that affect correctness and performance.
Basic Usage
The primary encoding operations involve checking current encoding, converting between encodings, and forcing encoding interpretation. The String#encoding
method returns the encoding object, while String#force_encoding
changes the encoding label without converting bytes.
str = "résumé"
str.encoding # => #<Encoding:UTF-8>
str.force_encoding('ASCII-8BIT') # Change label only
str.encoding # => #<Encoding:ASCII-8BIT>
str.valid_encoding? # => false (invalid ASCII bytes)
The String#encode
method converts byte sequences from one encoding to another. This differs from force_encoding
because it actually transforms the underlying bytes to match the target encoding's representation.
utf8_str = "café" # UTF-8 source
latin1_str = utf8_str.encode('ISO-8859-1')
latin1_str.encoding # => #<Encoding:ISO-8859-1>
latin1_str.bytes # => [99, 97, 102, 233] (different bytes)
# Converting back
utf8_again = latin1_str.encode('UTF-8')
utf8_again == utf8_str # => true
The Encoding.default_external
and Encoding.default_internal
settings control automatic conversion during I/O operations. Ruby applies these defaults when reading files or network data without explicit encoding specification.
# Check current defaults
Encoding.default_external # => #<Encoding:UTF-8>
Encoding.default_internal # => nil
# File operations use these defaults
File.write('test.txt', 'café')
content = File.read('test.txt')
content.encoding # => #<Encoding:UTF-8>
String concatenation between different encodings follows specific precedence rules. ASCII-compatible encodings can combine with ASCII strings, but incompatible encodings raise Encoding::CompatibilityError
.
utf8_str = "café" # UTF-8
ascii_str = "shop" # ASCII (compatible with UTF-8)
result = utf8_str + ascii_str # Works fine
result.encoding # => #<Encoding:UTF-8>
# Incompatible encodings
utf16_str = "test".encode('UTF-16')
utf8_str + utf16_str # => Encoding::CompatibilityError
Pattern matching with regular expressions requires encoding compatibility between the pattern and target string. Ruby automatically promotes ASCII patterns to match the string's encoding when possible.
utf8_text = "The café is open"
ascii_pattern = /caf/ # ASCII pattern
utf8_text =~ ascii_pattern # => 4 (works due to ASCII compatibility)
# Unicode character classes work with proper encoding
utf8_text.scan(/\p{L}+/) # => ["The", "café", "is", "open"]
Error Handling & Debugging
Encoding errors fall into several categories, each requiring different handling strategies. Encoding::UndefinedConversionError
occurs when the target encoding cannot represent specific characters. Encoding::InvalidByteSequenceError
happens when source bytes are invalid for their declared encoding.
# UndefinedConversionError - character doesn't exist in target encoding
utf8_str = "café"
begin
ascii_str = utf8_str.encode('ASCII')
rescue Encoding::UndefinedConversionError => e
puts "Cannot convert: #{e.error_char}" # => "Cannot convert: é"
puts "Source encoding: #{e.source_encoding}" # => UTF-8
puts "Target encoding: #{e.destination_encoding}" # => US-ASCII
end
The encode
method accepts options to handle conversion errors gracefully. The :invalid
and :undef
options specify replacement strategies, while :replace
sets the replacement string.
problematic = "café\xFF" # Mixed valid UTF-8 and invalid byte
# Handle both undefined conversion and invalid bytes
safe_ascii = problematic.encode('ASCII',
:invalid => :replace, # Replace invalid bytes
:undef => :replace, # Replace undefined characters
:replace => '?') # Use '?' as replacement
# => "caf??"
# XML-safe replacements
xml_safe = problematic.encode('ASCII',
:invalid => :replace,
:undef => :replace,
:replace => '�') # Unicode replacement character entity
Debugging encoding issues requires examining both the byte-level representation and the logical character structure. The String#bytes
method reveals the actual byte sequence, while String#codepoints
shows Unicode values.
suspicious = "caf\xE9" # Suspicious string
suspicious.encoding # => #<Encoding:UTF-8> (default)
suspicious.valid_encoding? # => false
# Examine the bytes
suspicious.bytes # => [99, 97, 102, 233]
# 233 (0xE9) is invalid UTF-8 sequence start
# Try different encoding interpretation
latin1_version = suspicious.force_encoding('ISO-8859-1')
latin1_version.valid_encoding? # => true
latin1_version # => "café" (correct interpretation)
# Convert to proper UTF-8
fixed = latin1_version.encode('UTF-8')
fixed.bytes # => [99, 97, 102, 195, 169] (valid UTF-8)
The String#scrub
method removes or replaces invalid byte sequences, providing a robust way to clean untrusted input. This method works regardless of the specific encoding errors present.
# String with multiple encoding problems
broken = "good\xFF\xFEbad\x80text"
broken.valid_encoding? # => false
# Remove invalid sequences
cleaned = broken.scrub
cleaned # => "goodbadtext"
# Custom replacement
cleaned_custom = broken.scrub('�')
cleaned_custom # => "good�bad�text"
# Block form for complex replacement logic
smart_clean = broken.scrub do |invalid_bytes|
invalid_bytes.unpack('H*')[0] # Show hex representation
end
smart_clean # => "goodfffebadjpegtext"
File encoding detection requires heuristic analysis since most file formats don't include explicit encoding metadata. The chardet
gem provides sophisticated detection, but simple patterns often suffice for known data sources.
# Simple BOM detection for UTF files
def detect_encoding(file_path)
bytes = File.read(file_path, 3, encoding: 'ASCII-8BIT')
case bytes
when "\xEF\xBB\xBF" # UTF-8 BOM
'UTF-8'
when "\xFF\xFE" # UTF-16LE BOM
'UTF-16LE'
when "\xFE\xFF" # UTF-16BE BOM
'UTF-16BE'
else
'UTF-8' # Assume UTF-8 for unknown
end
end
Performance & Memory
Encoding conversions carry significant performance costs, especially for large strings. UTF-8 to ASCII conversion requires scanning every byte, while UTF-16 to UTF-8 conversion involves mathematical transformations for each character. Avoiding unnecessary conversions improves both speed and memory usage.
# Performance comparison setup
require 'benchmark'
large_text = "café" * 100_000 # 400KB UTF-8 string
iterations = 1000
Benchmark.bm(20) do |x|
x.report("force_encoding:") do
iterations.times { large_text.force_encoding('ASCII-8BIT') }
end
x.report("encode ASCII:") do
iterations.times { large_text.encode('ASCII', :invalid => :replace) }
end
x.report("encode UTF-16:") do
iterations.times { large_text.encode('UTF-16') }
end
end
# Results show force_encoding is nearly free (just metadata change)
# while encode operations involve actual byte transformation
Memory allocation patterns differ significantly between encoding operations. force_encoding
changes only metadata without allocating new strings, while encode
always creates new string objects with potentially different byte lengths.
original = "café" * 10_000
puts "Original: #{original.bytesize} bytes"
# Force encoding - no new allocation
forced = original.force_encoding('ASCII-8BIT')
puts "Forced: #{forced.bytesize} bytes" # Same byte count
# Encoding conversion - new allocation required
utf16_version = original.encode('UTF-16')
puts "UTF-16: #{utf16_version.bytesize} bytes" # Different byte count (larger)
ascii_version = original.encode('ASCII', :invalid => :replace)
puts "ASCII: #{ascii_version.bytesize} bytes" # Different byte count (smaller)
String operations on encoded text show varying performance characteristics. Character-based operations like String#size
require encoding-aware scanning, while byte-based operations like String#bytesize
access cached metadata.
text = "résumé" * 50_000 # Mixed ASCII/UTF-8 content
Benchmark.bm(15) do |x|
x.report("bytesize:") do
100_000.times { text.bytesize } # O(1) - metadata lookup
end
x.report("size:") do
100_000.times { text.size } # O(n) - character counting
end
x.report("valid?:") do
10_000.times { text.valid_encoding? } # O(n) - full validation scan
end
end
Bulk processing benefits from encoding normalization strategies that minimize conversion overhead. Converting all inputs to a common encoding once, rather than converting repeatedly during processing, reduces total computational cost.
# Inefficient: convert during each operation
def process_files_inefficient(file_paths)
results = []
file_paths.each do |path|
content = File.read(path, encoding: 'ASCII-8BIT') # Read as binary
utf8_content = content.encode('UTF-8', :invalid => :replace)
processed = utf8_content.upcase.gsub(/\s+/, ' ') # Multiple operations
results << processed.encode('ASCII', :invalid => :replace)
end
results
end
# Efficient: batch convert once
def process_files_efficient(file_paths)
# Read and normalize all files first
normalized_contents = file_paths.map do |path|
content = File.read(path, encoding: 'ASCII-8BIT')
content.encode('UTF-8', :invalid => :replace)
end
# Process in consistent encoding
results = normalized_contents.map do |content|
content.upcase.gsub(/\s+/, ' ')
end
# Convert outputs once if needed
results.map { |r| r.encode('ASCII', :invalid => :replace) }
end
Common Pitfalls
Encoding assumption errors represent the most frequent category of string encoding problems. Developers often assume UTF-8 encoding for external data, but file systems, databases, and network protocols may use different encodings without explicit indication.
# Dangerous assumption - file might not be UTF-8
def read_user_file_wrong(path)
content = File.read(path) # Assumes UTF-8
content.upcase # May fail with invalid UTF-8
end
# Safe approach with encoding detection/handling
def read_user_file_safe(path)
# Read as binary first to inspect bytes
binary_content = File.read(path, encoding: 'ASCII-8BIT')
# Try UTF-8 first
if binary_content.force_encoding('UTF-8').valid_encoding?
return binary_content.force_encoding('UTF-8')
end
# Fallback to Latin-1 (can represent any byte)
binary_content.force_encoding('ISO-8859-1').encode('UTF-8')
end
The force_encoding
vs encode
confusion leads to subtle bugs where strings appear correct but contain invalid byte sequences. force_encoding
changes interpretation without validating compatibility, while encode
performs actual conversion.
# Common mistake: using force_encoding for conversion
latin1_bytes = "café".encode('ISO-8859-1')
# Wrong way - just changes label, doesn't convert bytes
wrong_utf8 = latin1_bytes.force_encoding('UTF-8')
wrong_utf8.valid_encoding? # => false! (invalid UTF-8 bytes)
# Correct way - actually converts byte sequences
right_utf8 = latin1_bytes.encode('UTF-8')
right_utf8.valid_encoding? # => true
Regular expression encoding compatibility creates unexpected match failures. Patterns compiled with one encoding may fail to match strings with different encodings, even when the content appears identical.
# Pattern compiled with ASCII encoding
ascii_pattern = /café/
ascii_pattern.encoding # => #<Encoding:US-ASCII>
utf8_text = "I love café food"
utf8_text.encoding # => #<Encoding:UTF-8>
# This works due to ASCII compatibility promotion
result = utf8_text =~ ascii_pattern # => 7
# But explicit ASCII pattern breaks with non-ASCII content
strict_ascii_pattern = /caf\xe9/.force_encoding('ASCII')
utf8_text =~ strict_ascii_pattern # => nil (encoding incompatible)
# Solution: ensure pattern encoding matches text encoding
compatible_pattern = /café/.encode(utf8_text.encoding)
utf8_text =~ compatible_pattern # => 7
JSON and XML parsing libraries exhibit inconsistent encoding behavior. Some libraries respect source encoding, others force UTF-8, and many provide configuration options that change default behavior between versions.
require 'json'
# JSON standard requires UTF-8, but input might be different
latin1_json_bytes = '{"name": "caf\xe9"}'.encode('ISO-8859-1')
# This might fail depending on JSON library version and configuration
begin
# Some JSON parsers expect UTF-8 only
parsed = JSON.parse(latin1_json_bytes)
rescue JSON::ParserError => e
# Convert to UTF-8 first
utf8_json = latin1_json_bytes.encode('UTF-8')
parsed = JSON.parse(utf8_json)
end
Database encoding mismatches cause data corruption that persists across application restarts. Character data stored with incorrect encoding interpretation becomes permanently corrupted unless detected and repaired quickly.
# Simulate database encoding mismatch
class DatabaseSimulator
def initialize
@storage = []
end
# Wrong: stores UTF-8 bytes as Latin-1
def store_wrong(text)
# Application sends UTF-8, database interprets as Latin-1
stored_bytes = text.encode('UTF-8').force_encoding('ISO-8859-1')
@storage << stored_bytes
stored_bytes
end
# Retrieval compounds the problem
def retrieve_wrong(index)
stored = @storage[index]
# Database returns "Latin-1", application expects UTF-8
stored.force_encoding('UTF-8') # Invalid UTF-8!
end
# Correct approach - consistent encoding throughout
def store_correct(text)
utf8_text = text.encode('UTF-8') # Normalize input
@storage << utf8_text
utf8_text
end
def retrieve_correct(index)
@storage[index] # Already UTF-8
end
end
# Demonstration of corruption
db = DatabaseSimulator.new
original = "café"
# Wrong way corrupts data
db.store_wrong(original)
retrieved_wrong = db.retrieve_wrong(0)
retrieved_wrong.valid_encoding? # => false
retrieved_wrong == original # => false
# Correct way preserves data
db.store_correct(original)
retrieved_correct = db.retrieve_correct(1)
retrieved_correct.valid_encoding? # => true
retrieved_correct == original # => true
Reference
Core Encoding Methods
Method | Parameters | Returns | Description |
---|---|---|---|
String#encoding |
none | Encoding |
Returns the encoding object for the string |
String#force_encoding(encoding) |
encoding (String/Encoding) |
String |
Changes encoding label without converting bytes |
String#encode(encoding, **opts) |
encoding (String/Encoding), options (Hash) |
String |
Converts string to specified encoding |
String#encode!(encoding, **opts) |
encoding (String/Encoding), options (Hash) |
String |
In-place encoding conversion |
String#valid_encoding? |
none | Boolean |
Tests if string contains valid byte sequences |
String#ascii_only? |
none | Boolean |
Tests if string contains only ASCII characters |
String#scrub(replace=nil) |
replace (String), block optional |
String |
Removes or replaces invalid byte sequences |
Encoding Information Methods
Method | Parameters | Returns | Description |
---|---|---|---|
String#bytesize |
none | Integer |
Returns byte count in string |
String#size / String#length |
none | Integer |
Returns character count in string |
String#bytes |
none | Array<Integer> |
Returns array of byte values |
String#codepoints |
none | Array<Integer> |
Returns array of Unicode code points |
String#each_byte |
block | Enumerator/String |
Iterates over each byte |
String#each_codepoint |
block | Enumerator/String |
Iterates over each code point |
Encoding Class Methods
Method | Parameters | Returns | Description |
---|---|---|---|
Encoding.list |
none | Array<Encoding> |
Returns all available encodings |
Encoding.find(name) |
name (String) |
Encoding |
Finds encoding by name or alias |
Encoding.default_external |
none | Encoding |
Returns default external encoding |
Encoding.default_internal |
none | Encoding/nil |
Returns default internal encoding |
Encoding.compatible?(str1, str2) |
two objects | Encoding/nil |
Returns compatible encoding or nil |
Conversion Options
Option | Values | Description |
---|---|---|
:invalid |
:replace , :ignore |
How to handle invalid byte sequences |
:undef |
:replace , :ignore |
How to handle undefined character conversions |
:replace |
String | Replacement string for invalid/undefined characters |
:fallback |
Hash/Proc | Character-specific replacement mappings |
:xml |
:text , :attr |
XML-safe replacement mode |
:cr_newline |
Boolean | Convert CR to LF during transcoding |
:crlf_newline |
Boolean | Convert CRLF to LF during transcoding |
:universal_newline |
Boolean | Convert all newline types to LF |
Common Encoding Names
Encoding | Aliases | Byte Width | Description |
---|---|---|---|
UTF-8 |
CP65001 |
1-4 bytes | Unicode, web standard |
UTF-16 |
UTF-16BE/LE |
2-4 bytes | Unicode, Windows common |
ASCII |
US-ASCII |
1 byte | 7-bit ASCII characters only |
ASCII-8BIT |
BINARY |
1 byte | Binary data, no character interpretation |
ISO-8859-1 |
Latin-1 |
1 byte | Western European languages |
Windows-1252 |
CP1252 |
1 byte | Windows Western European |
Shift_JIS |
SJIS |
1-2 bytes | Japanese character encoding |
Exception Hierarchy
EncodingError
├── Encoding::UndefinedConversionError
├── Encoding::InvalidByteSequenceError
├── Encoding::ConverterNotFoundError
└── Encoding::CompatibilityError
Encoding Detection Patterns
Byte Sequence | Encoding | Location |
---|---|---|
EF BB BF |
UTF-8 | BOM at start |
FF FE |
UTF-16LE | BOM at start |
FE FF |
UTF-16BE | BOM at start |
00 xx 00 xx |
UTF-16BE | Pattern in text |
xx 00 xx 00 |
UTF-16LE | Pattern in text |
< 0x80 all bytes |
ASCII | Throughout |
Valid UTF-8 sequences | UTF-8 | Heuristic analysis |
Performance Characteristics
Operation | Time Complexity | Notes |
---|---|---|
String#bytesize |
O(1) | Cached metadata |
String#size |
O(n) | Must count characters |
String#valid_encoding? |
O(n) | Full validation scan |
String#force_encoding |
O(1) | Metadata change only |
String#encode |
O(n) | Byte transformation required |
String#scrub |
O(n) | Scans and potentially copies |