CrackedRuby logo

CrackedRuby

Encoding Conversion

Convert text between character encodings using Ruby's built-in encoding system and String methods.

Standard Library Internationalization
4.13.2

Overview

Ruby handles text encoding through the Encoding class and String methods that convert character data between different encoding formats. The encoding system operates on byte sequences, transforming them from one character representation to another while preserving the textual meaning.

String objects in Ruby carry encoding information internally. The String#encoding method returns the current encoding, while String#encode performs the actual conversion. Ruby supports over 100 encodings, including UTF-8, ASCII, ISO-8859-1, Windows-1252, and various Asian encodings like Shift_JIS and EUC-JP.

text = "Hello, 世界"
text.encoding
# => #<Encoding:UTF-8>

text.encode('ISO-8859-1', undef: :replace)
# => "Hello, ????"

The encoding conversion process involves three key components: the source encoding (current string encoding), the target encoding (desired output encoding), and conversion options that control error handling and replacement characters.

# Check available encodings
Encoding.list.first(5)
# => [#<Encoding:ASCII-8BIT>, #<Encoding:UTF-8>, #<Encoding:US-ASCII>, #<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>]

# Get encoding by name
Encoding.find('UTF-8')
# => #<Encoding:UTF-8>

Ruby's encoding system distinguishes between compatible and incompatible encodings. Compatible encodings can represent the same characters, while incompatible conversions may result in data loss or require replacement strategies.

ascii_text = "Hello"
utf8_text = ascii_text.encode('UTF-8')
ascii_text.encoding.compatible?(utf8_text.encoding)
# => #<Encoding:UTF-8>

Basic Usage

The primary method for encoding conversion is String#encode, which accepts a target encoding and optional parameters. The method returns a new string with the specified encoding, leaving the original string unchanged.

original = "Café"
converted = original.encode('ASCII', undef: :replace, replace: '?')
# => "Caf?"

original.encoding
# => #<Encoding:UTF-8>
converted.encoding  
# => #<Encoding:US-ASCII>

Ruby accepts encoding specifications as strings, symbols, or Encoding objects. The conversion behavior changes based on whether characters exist in the target encoding.

text = "München"

# String notation
text.encode('ISO-8859-1')
# => "München"

# Symbol notation  
text.encode(:ascii, undef: :replace)
# => "M?nchen"

# Encoding object
text.encode(Encoding::UTF_16)
# => "\xFF\xFEM\x00\xFC\x00n\x00c\x00h\x00e\x00n\x00"

The force_encoding method changes the encoding label without converting bytes, while encode performs actual character transformation. This distinction affects how Ruby interprets byte sequences.

bytes = "\xC3\xBC"  # UTF-8 bytes for 'ü'
bytes.encoding
# => #<Encoding:ASCII-8BIT>

# Force encoding interpretation
utf8_string = bytes.force_encoding('UTF-8')
utf8_string
# => "ü"

# Convert to different encoding
latin1_string = utf8_string.encode('ISO-8859-1')
latin1_string.bytes
# => [252]

Conversion options control how Ruby handles characters that don't exist in the target encoding. The undef parameter specifies behavior for undefined characters, while replace sets the replacement string.

japanese = "こんにちは"

# Default: raise exception
japanese.encode('ASCII') rescue 'error'
# => "error"

# Replace undefined characters
japanese.encode('ASCII', undef: :replace, replace: '[?]')
# => "[?][?][?][?][?]"

# Skip undefined characters
japanese.encode('ASCII', undef: :replace, replace: '')
# => ""

Error Handling & Debugging

Encoding conversions raise specific exceptions when characters cannot be represented in the target encoding or when byte sequences are invalid. Ruby provides several exception classes for different encoding problems.

Encoding::UndefinedConversionError occurs when source characters don't exist in the target encoding. This exception includes details about the problematic character and its position.

begin
  "Naïve".encode('ASCII')
rescue Encoding::UndefinedConversionError => e
  puts "Character: #{e.error_char.inspect}"
  puts "Source encoding: #{e.source_encoding}"
  puts "Destination encoding: #{e.destination_encoding}"
end
# Character: "ï"
# Source encoding: UTF-8  
# Destination encoding: US-ASCII

Encoding::InvalidByteSequenceError raises when the source string contains byte sequences that are invalid for its declared encoding. This commonly happens with binary data incorrectly labeled as text.

binary_data = "\xFF\xFE\x00\x41"
binary_data.force_encoding('UTF-8')

begin
  binary_data.encode('ASCII')
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid bytes at position: #{e.source_buffer.bytesize}"
  puts "Error byte: #{e.error_bytes.unpack('H*')}"
end

The valid_encoding? method checks if a string's bytes match its declared encoding without raising exceptions. This allows pre-validation before conversion attempts.

def safe_encode(text, target_encoding)
  unless text.valid_encoding?
    return nil, "Invalid byte sequence for #{text.encoding}"
  end

  begin
    converted = text.encode(target_encoding)
    [converted, nil]
  rescue Encoding::UndefinedConversionError => e
    [nil, "Cannot convert '#{e.error_char}' to #{target_encoding}"]
  rescue Encoding::InvalidByteSequenceError => e
    [nil, "Invalid byte sequence: #{e.error_bytes.unpack('H*')}"]
  end
end

result, error = safe_encode("Café", 'ASCII')
# => [nil, "Cannot convert 'é' to US-ASCII"]

Ruby provides fallback encodings and replacement strategies for robust error handling. The fallback option accepts a hash mapping problematic characters to replacements.

# Custom character replacements
replacements = {
  'é' => 'e',
  'ñ' => 'n',
  '©' => '(c)'
}

"Café ñoño ©2023".encode('ASCII', fallback: replacements)
# => "Cafe nono (c)2023"

Debugging encoding issues requires examining byte sequences and encoding labels separately. Ruby's inspect method shows both the string content and encoding information.

def debug_encoding(str)
  puts "String: #{str.inspect}"
  puts "Encoding: #{str.encoding}"
  puts "Bytes: #{str.bytes.map { |b| sprintf('%02X', b) }.join(' ')}"
  puts "Valid: #{str.valid_encoding?}"
  puts "ASCII only: #{str.ascii_only?}"
end

debug_encoding("test\xFF".force_encoding('UTF-8'))
# String: "test\xFF"
# Encoding: UTF-8
# Bytes: 74 65 73 74 FF
# Valid: false
# ASCII only: false

Performance & Memory

Encoding conversions involve memory allocation and character transformation algorithms that vary significantly in performance based on source and target encodings. ASCII-compatible encodings convert faster than multi-byte encodings, while transcoding between different multi-byte formats requires more processing.

Single-byte to single-byte conversions perform fastest, as Ruby can process bytes directly without complex character boundary detection. Multi-byte conversions require parsing character boundaries and may involve lookup tables for character mapping.

require 'benchmark'

ascii_text = "Hello World" * 1000
utf8_text = "Hello 世界" * 1000
complex_text = File.read('large_file.txt', encoding: 'UTF-8')

Benchmark.bmbm do |x|
  x.report("ASCII to Latin1") { ascii_text.encode('ISO-8859-1') }
  x.report("UTF-8 to UTF-16") { utf8_text.encode('UTF-16') }  
  x.report("UTF-8 to ASCII") { complex_text.encode('ASCII', undef: :replace) }
end

Memory usage patterns differ between encoding types. UTF-16 conversions double memory requirements for ASCII-compatible text, while UTF-8 to ASCII conversions may reduce memory usage by removing multi-byte sequences.

original = "Mixed text with émojis 🎉" * 10000

# Memory efficient: same character count, different byte representation
ascii_version = original.encode('ASCII', undef: :replace, replace: '?')

puts "Original bytes: #{original.bytesize}"
puts "ASCII bytes: #{ascii_version.bytesize}"  
puts "Character count: #{original.length} vs #{ascii_version.length}"

Large file processing benefits from streaming conversions rather than loading entire files into memory. Ruby's IO methods accept encoding parameters for automatic conversion during read operations.

# Memory efficient file processing
def convert_large_file(input_path, output_path, target_encoding)
  File.open(output_path, 'w', encoding: target_encoding) do |output|
    File.open(input_path, 'r', encoding: 'UTF-8') do |input|
      input.each_line do |line|
        begin
          converted_line = line.encode(target_encoding, undef: :replace)
          output.write(converted_line)
        rescue Encoding::InvalidByteSequenceError
          # Skip invalid lines or handle specifically
          next
        end
      end
    end
  end
end

Conversion performance varies with character distribution. Text containing only ASCII characters converts quickly to any ASCII-compatible encoding, while text with many Unicode characters requires more processing time.

def performance_comparison(text, iterations = 10000)
  encodings = ['ASCII', 'ISO-8859-1', 'UTF-16', 'Shift_JIS']
  
  encodings.each do |target|
    time = Benchmark.realtime do
      iterations.times do
        text.encode(target, undef: :replace) rescue nil
      end
    end
    
    puts "#{target}: #{(time * 1000).round(2)}ms for #{iterations} conversions"
  end
end

performance_comparison("Hello World")  # ASCII only
performance_comparison("こんにちは")    # Multi-byte characters

Common Pitfalls

The force_encoding method changes encoding labels without converting bytes, leading to corrupted text when used incorrectly. Developers often confuse this with actual encoding conversion.

# WRONG: This corrupts the text
utf8_bytes = "Café".bytes  # [67, 97, 102, 195, 169]
binary_string = utf8_bytes.pack('C*').force_encoding('ASCII')
binary_string
# => "Caf\xC3\xA9"

# CORRECT: Use encode for conversion
binary_string.force_encoding('UTF-8').encode('ASCII', undef: :replace)
# => "Caf?"

Binary data mixed with text causes encoding errors when Ruby interprets non-text bytes as characters. This commonly occurs when reading files that contain both text and binary sections.

# File contains text followed by binary data
mixed_content = "Header: Info\n\x00\x01\x02\xFF\xFE"

# This will fail
mixed_content.force_encoding('UTF-8').encode('ASCII') rescue 'encoding error'
# => "encoding error"

# Solution: separate text and binary portions
text_portion = mixed_content.split("\x00").first
text_portion.encode('ASCII', undef: :replace)
# => "Header: Info\n"

Default encoding assumptions cause problems when system default encodings differ from file encodings. Ruby uses system locale settings that may not match file content.

# System default might be ASCII
Encoding.default_external
# => #<Encoding:US-ASCII>

# But file contains UTF-8 
# This creates encoding mismatches
file_content = File.read('utf8_file.txt')  # Assumes system default
file_content.valid_encoding?
# => false

# Always specify encoding explicitly  
file_content = File.read('utf8_file.txt', encoding: 'UTF-8')
file_content.valid_encoding?
# => true

Encoding conversion loses information when target encodings cannot represent source characters. This data loss is permanent and cannot be reversed without the original text.

original = "Résumé with naïve café"
ascii_version = original.encode('ASCII', undef: :replace, replace: '?')
# => "R?sum? with na?ve caf?"

# Cannot recover original characters
ascii_version.encode('UTF-8')
# => "R?sum? with na?ve caf?"  # Still has question marks

String concatenation between different encodings follows complex compatibility rules that can produce unexpected results or raise exceptions.

ascii_str = "Hello".encode('ASCII')  
utf8_str = " 世界".encode('UTF-8')

# This works: ASCII is UTF-8 compatible
result = ascii_str + utf8_str
result.encoding
# => #<Encoding:UTF-8>

# This fails: incompatible encodings
latin1_str = "café".encode('ISO-8859-1')
utf16_str = "world".encode('UTF-16')

latin1_str + utf16_str rescue 'incompatible encodings'
# => "incompatible encodings"

Regular expressions operate on encoding-aware character boundaries, not byte boundaries. Pattern matching can fail when encoding assumptions don't match string content.

# Pattern assumes UTF-8, but string is Latin1
pattern = /café/
latin1_string = "café".encode('ISO-8859-1')

# This may not match as expected
pattern.match(latin1_string)
# Depends on the specific characters and encodings involved

# Convert to matching encoding first
pattern.match(latin1_string.encode('UTF-8'))
# => #<MatchData "café">

Reference

Core Classes and Methods

Class/Method Purpose Returns
Encoding Represents character encoding Encoding object
String#encoding Get string's encoding Encoding object
String#encode(encoding, **opts) Convert to target encoding New string
String#force_encoding(encoding) Change encoding label Self (modified)
String#valid_encoding? Check encoding validity Boolean
String#ascii_only? Check if ASCII characters only Boolean

Encoding Detection and Conversion Options

Method Parameters Returns Description
Encoding.find(name) name (String/Symbol) Encoding Get encoding by name
Encoding.list None Array<Encoding> All available encodings
Encoding.compatible?(str1, str2) Two objects Encoding or nil Common compatible encoding
String#encode(to, **opts) to (Encoding), options (Hash) String Convert encoding with options

Conversion Options

Option Values Default Description
undef :raise, :replace :raise Undefined character handling
invalid :raise, :replace :raise Invalid byte sequence handling
replace String "?" Replacement character/string
fallback Hash, Proc nil Custom character mappings
xml :text, :attr nil XML entity encoding
cr_newline Boolean false Convert CR to LF
crlf_newline Boolean false Convert CRLF to LF
universal_newline Boolean false Convert any newline to LF

Exception Hierarchy

EncodingError
├── Encoding::UndefinedConversionError
├── Encoding::InvalidByteSequenceError  
├── Encoding::ConverterNotFoundError
└── Encoding::CompatibilityError

Common Encodings

Encoding Name Aliases Character Set Use Case
UTF-8 None Unicode Web, modern text
ASCII US-ASCII 7-bit ASCII Legacy systems
ISO-8859-1 Latin1 Latin alphabet Western European
Windows-1252 CP1252 Extended Latin Windows systems
UTF-16 None Unicode Windows APIs
Shift_JIS SJIS Japanese Japanese systems
EUC-JP None Japanese Unix Japanese
ASCII-8BIT BINARY Raw bytes Binary data

String Method Quick Reference

# Encoding information
str.encoding              # Current encoding
str.valid_encoding?       # Validity check  
str.ascii_only?          # ASCII character check

# Encoding conversion
str.encode(target)                    # Basic conversion
str.encode(target, undef: :replace)   # Replace undefined chars
str.encode(target, invalid: :replace) # Replace invalid bytes  
str.force_encoding(encoding)          # Change label only

# Byte operations
str.bytes                 # Array of byte values
str.bytesize             # Size in bytes
str.length               # Size in characters

Error Handling Patterns

# Exception handling
begin
  converted = str.encode(target)
rescue Encoding::UndefinedConversionError => e
  # Handle undefined characters
  retry_with_replace = str.encode(target, undef: :replace)
rescue Encoding::InvalidByteSequenceError => e  
  # Handle invalid byte sequences
  str = str.scrub  # Remove invalid sequences
  retry
end

# Validation before conversion
if str.valid_encoding?
  converted = str.encode(target, undef: :replace)
else
  cleaned = str.scrub
  converted = cleaned.encode(target, undef: :replace)  
end

File I/O with Encodings

# Read with specific encoding
File.read('file.txt', encoding: 'UTF-8')

# Write with encoding conversion
File.write('output.txt', content, encoding: 'ISO-8859-1')

# Open with encoding specification
File.open('file.txt', 'r:UTF-8:ASCII') do |f|
  # Reads UTF-8, converts to ASCII
end