Overview
Ruby's encoding system centers around the Encoding
class and encoding-aware string operations. The Encoding
class represents character encoding schemes like UTF-8, ASCII, and ISO-8859-1, while strings maintain encoding metadata that determines how byte sequences are interpreted as characters.
The system provides three primary capabilities: encoding identification, character conversion, and encoding validation. Every string in Ruby has an associated encoding that affects string operations, regular expression matching, and I/O operations.
# Encoding objects represent encoding schemes
utf8 = Encoding::UTF_8
ascii = Encoding::ASCII
# Strings carry encoding metadata
str = "Hello"
str.encoding # => #<Encoding:UTF-8>
# Encoding affects character interpretation
bytes = "\xC3\xA9" # UTF-8 bytes for "é"
bytes.force_encoding("UTF-8") # => "é"
bytes.force_encoding("ASCII") # => "\xC3\xA9"
Ruby maintains a default external encoding for I/O operations and a default internal encoding for string processing. The external encoding determines how Ruby interprets bytes from files and network connections, while the internal encoding controls automatic transcoding during I/O operations.
# Default encodings affect I/O behavior
Encoding.default_external # => #<Encoding:UTF-8>
Encoding.default_internal # => nil
# File operations use default external encoding
File.write("test.txt", "content")
File.read("test.txt").encoding # => #<Encoding:UTF-8>
The encoding system integrates with Ruby's string methods, regular expressions, and I/O operations. String concatenation, pattern matching, and file operations all respect encoding boundaries and perform automatic validation where appropriate.
Basic Usage
Encoding objects are accessed through constants in the Encoding
class or created from encoding names. Each encoding object contains metadata about the character encoding scheme, including its name, aliases, and properties.
# Access encodings by constant
utf8 = Encoding::UTF_8
latin1 = Encoding::ISO_8859_1
ascii = Encoding::US_ASCII
# Access encodings by name
utf8 = Encoding.find("UTF-8")
shift_jis = Encoding.find("Shift_JIS")
# List available encodings
Encoding.list.size # => 100+
Encoding.name_list.first(5) # => ["ASCII-8BIT", "UTF-8", "US-ASCII", ...]
String encoding is controlled through several methods that either change encoding metadata or perform actual character conversion. The force_encoding
method changes encoding interpretation without modifying bytes, while encode
performs character-by-character conversion.
# Change encoding interpretation (no byte modification)
data = "\x48\x65\x6C\x6C\x6F" # ASCII bytes for "Hello"
data.encoding # => #<Encoding:UTF-8>
ascii_str = data.force_encoding("ASCII")
ascii_str.encoding # => #<Encoding:US-ASCII>
# Convert between encodings (modifies bytes)
utf8_str = "Héllo" # UTF-8 string
latin1_str = utf8_str.encode("ISO-8859-1")
latin1_str.bytes # => [72, 233, 108, 108, 111]
utf8_str.bytes # => [72, 195, 169, 108, 108, 111]
Encoding validation ensures strings contain valid byte sequences for their declared encoding. The valid_encoding?
method checks validity without raising exceptions, while encoding-aware operations may raise Encoding::InvalidByteSequenceError
.
# Validate encoding correctness
valid_utf8 = "Hello 世界"
valid_utf8.valid_encoding? # => true
# Invalid byte sequences
invalid = "\xFF\xFE".force_encoding("UTF-8")
invalid.valid_encoding? # => false
# Check encoding compatibility
"ASCII".encoding.ascii_compatible? # => true
"UTF-8".encoding.ascii_compatible? # => true
"UTF-16".encoding.ascii_compatible? # => false
String operations between different encodings follow specific compatibility rules. ASCII-compatible encodings can be combined with ASCII strings, while incompatible encodings require explicit conversion or raise Encoding::CompatibilityError
.
# Compatible encoding operations
ascii_str = "Hello" # US-ASCII
utf8_str = " 世界" # UTF-8
combined = ascii_str + utf8_str # => "Hello 世界" (UTF-8)
# Incompatible encodings
utf16_str = "Hello".encode("UTF-16")
utf8_str = "World"
# utf16_str + utf8_str # => Encoding::CompatibilityError
# Explicit conversion resolves incompatibility
combined = utf16_str.encode("UTF-8") + utf8_str
Error Handling & Debugging
Encoding operations generate specific exception types that indicate different error conditions. Encoding::InvalidByteSequenceError
occurs when byte sequences are invalid for the target encoding, while Encoding::UndefinedConversionError
indicates characters that cannot be represented in the target encoding.
# Handle invalid byte sequences
begin
invalid_bytes = "\xFF\xFE"
invalid_bytes.force_encoding("UTF-8")
invalid_bytes.encode("UTF-16")
rescue Encoding::InvalidByteSequenceError => e
puts "Invalid bytes: #{e.error_bytes.inspect}"
puts "Source encoding: #{e.source_encoding}"
puts "Destination encoding: #{e.destination_encoding}"
end
# Handle undefined character conversions
begin
unicode_str = "Hello 🚀 World"
unicode_str.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
puts "Undefined character: #{e.error_char.inspect}"
puts "Cannot convert to: #{e.destination_encoding}"
end
Encoding conversion supports fallback strategies for handling conversion errors. The :invalid
and :undef
options control error handling behavior, while :replace
option specifies replacement characters.
# Replace invalid sequences
invalid_utf8 = "Hello \xFF World".force_encoding("UTF-8")
cleaned = invalid_utf8.encode("UTF-8",
invalid: :replace,
replace: "?")
# => "Hello ? World"
# Replace undefined characters
unicode_text = "Café 🚀"
ascii_safe = unicode_text.encode("ASCII",
undef: :replace,
replace: "?")
# => "Caf? ?"
# Skip problematic characters
cleaned = unicode_text.encode("ASCII",
undef: :replace,
replace: "")
# => "Caf"
Debugging encoding issues requires examining byte-level data and encoding metadata. The bytes
, codepoints
, and chars
methods provide different views of string data for diagnosis.
# Analyze problematic strings
problem_str = "\u00E9\u0301" # "é" with combining accent
puts "String: #{problem_str.inspect}"
puts "Encoding: #{problem_str.encoding}"
puts "Bytes: #{problem_str.bytes}"
puts "Codepoints: #{problem_str.codepoints.map { |c| "U+#{c.to_s(16).upcase}" }}"
puts "Valid: #{problem_str.valid_encoding?}"
# Compare different encoding interpretations
data = "\xC3\xA9" # UTF-8 bytes for "é"
puts "As UTF-8: #{data.force_encoding('UTF-8').inspect}"
puts "As Latin-1: #{data.force_encoding('ISO-8859-1').inspect}"
puts "As ASCII: #{data.force_encoding('ASCII').inspect}"
Encoding validation can be performed incrementally for large data processing. The Encoding::Converter
class provides fine-grained control over conversion processes and error handling.
# Incremental encoding validation and conversion
converter = Encoding::Converter.new("UTF-8", "ASCII",
invalid: :replace,
undef: :replace,
replace: "?")
# Process data in chunks
input = "Café 🚀 World"
output = ""
status = nil
begin
status = converter.primitive_convert(input, output, 5)
puts "Status: #{status}, Output so far: #{output.inspect}"
end while status == :destination_buffer_full
puts "Final output: #{output}"
puts "Conversion finished: #{status == :finished}"
Production Patterns
Web applications frequently encounter encoding issues when processing user input, file uploads, and API responses. Establishing consistent encoding handling patterns prevents data corruption and application errors.
# Sanitize user input encoding
def sanitize_user_input(input)
return "" if input.nil?
# Force UTF-8 and validate
sanitized = input.to_s.force_encoding("UTF-8")
unless sanitized.valid_encoding?
# Replace invalid sequences
sanitized = sanitized.encode("UTF-8",
invalid: :replace,
undef: :replace,
replace: "�")
end
sanitized
end
# Process form data
form_data = params[:message] # Potentially mixed encoding
clean_message = sanitize_user_input(form_data)
File processing operations require explicit encoding handling to prevent interpretation errors. Different file sources may use various encodings, requiring detection and normalization strategies.
# Robust file reading with encoding detection
def read_file_with_encoding(filepath)
# Try common encodings in order
encodings = ["UTF-8", "ISO-8859-1", "Windows-1252"]
encodings.each do |encoding|
begin
content = File.read(filepath, encoding: encoding)
return content if content.valid_encoding?
rescue Encoding::InvalidByteSequenceError
next
end
end
# Fallback: read as binary and force UTF-8 with replacement
binary_content = File.read(filepath, encoding: "ASCII-8BIT")
binary_content.force_encoding("UTF-8")
.encode("UTF-8", invalid: :replace, replace: "�")
end
# CSV processing with encoding handling
require 'csv'
def process_csv_file(filepath)
content = read_file_with_encoding(filepath)
CSV.parse(content, headers: true) do |row|
# Process each row with consistent encoding
row.each { |field| sanitize_user_input(field) }
end
end
API integration requires careful encoding handling for JSON responses and request data. Different APIs may return content in various encodings, requiring normalization before processing.
# HTTP client with encoding handling
require 'net/http'
require 'json'
def fetch_api_data(url)
uri = URI(url)
response = Net::HTTP.get_response(uri)
# Handle response encoding
body = response.body
# Check Content-Type header for encoding
content_type = response['Content-Type']
if content_type && content_type.include?('charset=')
charset = content_type.split('charset=').last.split(';').first
body = body.force_encoding(charset)
else
# Default to UTF-8 for JSON APIs
body = body.force_encoding("UTF-8")
end
# Validate and clean if necessary
unless body.valid_encoding?
body = body.encode("UTF-8", invalid: :replace, replace: "�")
end
JSON.parse(body)
rescue Encoding::InvalidByteSequenceError => e
Rails.logger.error "Encoding error processing API response: #{e.message}"
nil
end
Database operations benefit from explicit encoding configuration to ensure consistent data storage and retrieval. Most databases default to UTF-8, but legacy systems may use different encodings.
# Database encoding configuration
class ApplicationRecord < ActiveRecord::Base
# Ensure UTF-8 encoding for text fields
before_save :normalize_text_encodings
private
def normalize_text_encodings
self.class.columns.each do |column|
if column.type == :text || column.type == :string
value = read_attribute(column.name)
if value.is_a?(String)
normalized = sanitize_user_input(value)
write_attribute(column.name, normalized)
end
end
end
end
end
# Migration for encoding issues
class FixEncodingIssues < ActiveRecord::Migration[7.0]
def up
# Fix existing data with encoding problems
User.find_each do |user|
if user.name && !user.name.valid_encoding?
user.update_column(:name,
user.name.encode("UTF-8",
invalid: :replace,
replace: "�"))
end
end
end
end
Reference
Core Classes and Methods
Class/Method | Parameters | Returns | Description |
---|---|---|---|
Encoding.list |
none | Array<Encoding> |
All available encoding objects |
Encoding.find(name) |
name (String) |
Encoding |
Encoding object by name |
Encoding.default_external |
none | Encoding |
Default encoding for I/O operations |
Encoding.default_internal |
none | Encoding or nil |
Default internal encoding |
String#encoding |
none | Encoding |
String's current encoding |
String#force_encoding(encoding) |
encoding (String/Encoding) |
String |
Change encoding without conversion |
String#encode(encoding, **opts) |
encoding , options (Hash) |
String |
Convert to different encoding |
String#encode!(encoding, **opts) |
encoding , options (Hash) |
String |
In-place encoding conversion |
String#valid_encoding? |
none | Boolean |
Check if bytes are valid for encoding |
String#ascii_only? |
none | Boolean |
Check if string contains only ASCII characters |
String Encoding Methods
Method | Parameters | Returns | Description |
---|---|---|---|
String#b |
none | String |
Return binary (ASCII-8BIT) copy |
String#bytes |
none | Array<Integer> |
Array of byte values |
String#chars |
none | Array<String> |
Array of characters |
String#codepoints |
none | Array<Integer> |
Array of Unicode codepoints |
String#each_byte |
block | Enumerator or String |
Iterate over bytes |
String#each_char |
block | Enumerator or String |
Iterate over characters |
String#each_codepoint |
block | Enumerator or String |
Iterate over codepoints |
Encoding Objects
Method | Parameters | Returns | Description |
---|---|---|---|
Encoding#name |
none | String |
Canonical encoding name |
Encoding#names |
none | Array<String> |
All names and aliases |
Encoding#ascii_compatible? |
none | Boolean |
Whether encoding is ASCII-compatible |
Encoding#dummy? |
none | Boolean |
Whether encoding is a dummy encoding |
Encoding#inspect |
none | String |
Human-readable representation |
Encoding Converter
Method | Parameters | Returns | Description |
---|---|---|---|
Encoding::Converter.new(src, dst, **opts) |
source, destination, options | Converter |
Create new converter |
Converter#convert(string) |
string (String) |
String |
Convert string completely |
Converter#primitive_convert(src, dst, limit) |
source, destination, byte limit | Symbol |
Incremental conversion |
Converter#finish |
none | String |
Finish conversion and return remaining output |
Converter#source_encoding |
none | Encoding |
Source encoding |
Converter#destination_encoding |
none | Encoding |
Destination encoding |
Common Encodings
Constant | Name | Description |
---|---|---|
Encoding::ASCII_8BIT |
"ASCII-8BIT" | Binary data, no character interpretation |
Encoding::US_ASCII |
"US-ASCII" | 7-bit ASCII character set |
Encoding::UTF_8 |
"UTF-8" | Unicode UTF-8 encoding |
Encoding::UTF_16 |
"UTF-16" | Unicode UTF-16 encoding |
Encoding::UTF_32 |
"UTF-32" | Unicode UTF-32 encoding |
Encoding::ISO_8859_1 |
"ISO-8859-1" | Latin-1 character set |
Encoding::Windows_1252 |
"Windows-1252" | Windows Latin character set |
Encoding::Shift_JIS |
"Shift_JIS" | Japanese character encoding |
Conversion Options
Option | Values | Description |
---|---|---|
:invalid |
:replace |
Replace invalid byte sequences |
:undef |
:replace |
Replace undefined characters |
:replace |
String | Replacement character/string |
:fallback |
Hash/Proc | Custom character fallbacks |
:xml |
:text , :attr |
XML entity replacement mode |
:cr_newline |
Boolean | Convert CR to platform newlines |
:crlf_newline |
Boolean | Convert CRLF to platform newlines |
:universal_newline |
Boolean | Convert all newlines to platform format |
Exception Hierarchy
Exception | Description |
---|---|
Encoding::CompatibilityError |
Incompatible encodings in operation |
Encoding::InvalidByteSequenceError |
Invalid bytes for target encoding |
Encoding::UndefinedConversionError |
Character undefined in target encoding |
Encoding::ConverterNotFoundError |
No converter available for encoding pair |
Error Information Methods
Method | Available On | Returns | Description |
---|---|---|---|
#source_encoding |
InvalidByteSequenceError, UndefinedConversionError | Encoding |
Source encoding |
#destination_encoding |
InvalidByteSequenceError, UndefinedConversionError | Encoding |
Target encoding |
#error_bytes |
InvalidByteSequenceError | String |
Invalid byte sequence |
#error_char |
UndefinedConversionError | String |
Undefined character |
#readagain_bytes |
InvalidByteSequenceError | String |
Additional problematic bytes |