CrackedRuby logo

CrackedRuby

Encoding Classes

Ruby's encoding classes provide comprehensive text encoding management, conversion, and validation for handling character data across different encoding systems.

Standard Library Internationalization
4.13.1

Overview

Ruby's encoding system centers around the Encoding class and encoding-aware string operations. The Encoding class represents character encoding schemes like UTF-8, ASCII, and ISO-8859-1, while strings maintain encoding metadata that determines how byte sequences are interpreted as characters.

The system provides three primary capabilities: encoding identification, character conversion, and encoding validation. Every string in Ruby has an associated encoding that affects string operations, regular expression matching, and I/O operations.

# Encoding objects represent encoding schemes
utf8 = Encoding::UTF_8
ascii = Encoding::ASCII

# Strings carry encoding metadata
str = "Hello"
str.encoding  # => #<Encoding:UTF-8>

# Encoding affects character interpretation
bytes = "\xC3\xA9"  # UTF-8 bytes for "é"
bytes.force_encoding("UTF-8")   # => "é"
bytes.force_encoding("ASCII")   # => "\xC3\xA9"

Ruby maintains a default external encoding for I/O operations and a default internal encoding for string processing. The external encoding determines how Ruby interprets bytes from files and network connections, while the internal encoding controls automatic transcoding during I/O operations.

# Default encodings affect I/O behavior
Encoding.default_external  # => #<Encoding:UTF-8>
Encoding.default_internal  # => nil

# File operations use default external encoding
File.write("test.txt", "content")
File.read("test.txt").encoding  # => #<Encoding:UTF-8>

The encoding system integrates with Ruby's string methods, regular expressions, and I/O operations. String concatenation, pattern matching, and file operations all respect encoding boundaries and perform automatic validation where appropriate.

Basic Usage

Encoding objects are accessed through constants in the Encoding class or created from encoding names. Each encoding object contains metadata about the character encoding scheme, including its name, aliases, and properties.

# Access encodings by constant
utf8 = Encoding::UTF_8
latin1 = Encoding::ISO_8859_1
ascii = Encoding::US_ASCII

# Access encodings by name
utf8 = Encoding.find("UTF-8")
shift_jis = Encoding.find("Shift_JIS")

# List available encodings
Encoding.list.size          # => 100+
Encoding.name_list.first(5) # => ["ASCII-8BIT", "UTF-8", "US-ASCII", ...]

String encoding is controlled through several methods that either change encoding metadata or perform actual character conversion. The force_encoding method changes encoding interpretation without modifying bytes, while encode performs character-by-character conversion.

# Change encoding interpretation (no byte modification)
data = "\x48\x65\x6C\x6C\x6F"  # ASCII bytes for "Hello"
data.encoding                   # => #<Encoding:UTF-8>
ascii_str = data.force_encoding("ASCII")
ascii_str.encoding             # => #<Encoding:US-ASCII>

# Convert between encodings (modifies bytes)
utf8_str = "Héllo"             # UTF-8 string
latin1_str = utf8_str.encode("ISO-8859-1")
latin1_str.bytes               # => [72, 233, 108, 108, 111]
utf8_str.bytes                 # => [72, 195, 169, 108, 108, 111]

Encoding validation ensures strings contain valid byte sequences for their declared encoding. The valid_encoding? method checks validity without raising exceptions, while encoding-aware operations may raise Encoding::InvalidByteSequenceError.

# Validate encoding correctness
valid_utf8 = "Hello 世界"
valid_utf8.valid_encoding?     # => true

# Invalid byte sequences
invalid = "\xFF\xFE".force_encoding("UTF-8")
invalid.valid_encoding?        # => false

# Check encoding compatibility
"ASCII".encoding.ascii_compatible?  # => true
"UTF-8".encoding.ascii_compatible?  # => true
"UTF-16".encoding.ascii_compatible? # => false

String operations between different encodings follow specific compatibility rules. ASCII-compatible encodings can be combined with ASCII strings, while incompatible encodings require explicit conversion or raise Encoding::CompatibilityError.

# Compatible encoding operations
ascii_str = "Hello"                    # US-ASCII
utf8_str = " 世界"                     # UTF-8
combined = ascii_str + utf8_str        # => "Hello 世界" (UTF-8)

# Incompatible encodings
utf16_str = "Hello".encode("UTF-16")
utf8_str = "World"
# utf16_str + utf8_str  # => Encoding::CompatibilityError

# Explicit conversion resolves incompatibility
combined = utf16_str.encode("UTF-8") + utf8_str

Error Handling & Debugging

Encoding operations generate specific exception types that indicate different error conditions. Encoding::InvalidByteSequenceError occurs when byte sequences are invalid for the target encoding, while Encoding::UndefinedConversionError indicates characters that cannot be represented in the target encoding.

# Handle invalid byte sequences
begin
  invalid_bytes = "\xFF\xFE"
  invalid_bytes.force_encoding("UTF-8")
  invalid_bytes.encode("UTF-16")
rescue Encoding::InvalidByteSequenceError => e
  puts "Invalid bytes: #{e.error_bytes.inspect}"
  puts "Source encoding: #{e.source_encoding}"
  puts "Destination encoding: #{e.destination_encoding}"
end

# Handle undefined character conversions
begin
  unicode_str = "Hello 🚀 World"
  unicode_str.encode("ASCII")
rescue Encoding::UndefinedConversionError => e
  puts "Undefined character: #{e.error_char.inspect}"
  puts "Cannot convert to: #{e.destination_encoding}"
end

Encoding conversion supports fallback strategies for handling conversion errors. The :invalid and :undef options control error handling behavior, while :replace option specifies replacement characters.

# Replace invalid sequences
invalid_utf8 = "Hello \xFF World".force_encoding("UTF-8")
cleaned = invalid_utf8.encode("UTF-8", 
                              invalid: :replace, 
                              replace: "?")
# => "Hello ? World"

# Replace undefined characters  
unicode_text = "Café 🚀"
ascii_safe = unicode_text.encode("ASCII", 
                                 undef: :replace, 
                                 replace: "?")
# => "Caf? ?"

# Skip problematic characters
cleaned = unicode_text.encode("ASCII", 
                             undef: :replace, 
                             replace: "")
# => "Caf"

Debugging encoding issues requires examining byte-level data and encoding metadata. The bytes, codepoints, and chars methods provide different views of string data for diagnosis.

# Analyze problematic strings
problem_str = "\u00E9\u0301"  # "é" with combining accent
puts "String: #{problem_str.inspect}"
puts "Encoding: #{problem_str.encoding}"
puts "Bytes: #{problem_str.bytes}"
puts "Codepoints: #{problem_str.codepoints.map { |c| "U+#{c.to_s(16).upcase}" }}"
puts "Valid: #{problem_str.valid_encoding?}"

# Compare different encoding interpretations
data = "\xC3\xA9"  # UTF-8 bytes for "é"
puts "As UTF-8: #{data.force_encoding('UTF-8').inspect}"
puts "As Latin-1: #{data.force_encoding('ISO-8859-1').inspect}"
puts "As ASCII: #{data.force_encoding('ASCII').inspect}"

Encoding validation can be performed incrementally for large data processing. The Encoding::Converter class provides fine-grained control over conversion processes and error handling.

# Incremental encoding validation and conversion
converter = Encoding::Converter.new("UTF-8", "ASCII", 
                                   invalid: :replace, 
                                   undef: :replace,
                                   replace: "?")

# Process data in chunks
input = "Café 🚀 World"
output = ""
status = nil

begin
  status = converter.primitive_convert(input, output, 5)
  puts "Status: #{status}, Output so far: #{output.inspect}"
end while status == :destination_buffer_full

puts "Final output: #{output}"
puts "Conversion finished: #{status == :finished}"

Production Patterns

Web applications frequently encounter encoding issues when processing user input, file uploads, and API responses. Establishing consistent encoding handling patterns prevents data corruption and application errors.

# Sanitize user input encoding
def sanitize_user_input(input)
  return "" if input.nil?
  
  # Force UTF-8 and validate
  sanitized = input.to_s.force_encoding("UTF-8")
  
  unless sanitized.valid_encoding?
    # Replace invalid sequences
    sanitized = sanitized.encode("UTF-8", 
                                invalid: :replace, 
                                undef: :replace, 
                                replace: "")
  end
  
  sanitized
end

# Process form data
form_data = params[:message]  # Potentially mixed encoding
clean_message = sanitize_user_input(form_data)

File processing operations require explicit encoding handling to prevent interpretation errors. Different file sources may use various encodings, requiring detection and normalization strategies.

# Robust file reading with encoding detection
def read_file_with_encoding(filepath)
  # Try common encodings in order
  encodings = ["UTF-8", "ISO-8859-1", "Windows-1252"]
  
  encodings.each do |encoding|
    begin
      content = File.read(filepath, encoding: encoding)
      return content if content.valid_encoding?
    rescue Encoding::InvalidByteSequenceError
      next
    end
  end
  
  # Fallback: read as binary and force UTF-8 with replacement
  binary_content = File.read(filepath, encoding: "ASCII-8BIT")
  binary_content.force_encoding("UTF-8")
               .encode("UTF-8", invalid: :replace, replace: "")
end

# CSV processing with encoding handling
require 'csv'

def process_csv_file(filepath)
  content = read_file_with_encoding(filepath)
  
  CSV.parse(content, headers: true) do |row|
    # Process each row with consistent encoding
    row.each { |field| sanitize_user_input(field) }
  end
end

API integration requires careful encoding handling for JSON responses and request data. Different APIs may return content in various encodings, requiring normalization before processing.

# HTTP client with encoding handling
require 'net/http'
require 'json'

def fetch_api_data(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)
  
  # Handle response encoding
  body = response.body
  
  # Check Content-Type header for encoding
  content_type = response['Content-Type']
  if content_type && content_type.include?('charset=')
    charset = content_type.split('charset=').last.split(';').first
    body = body.force_encoding(charset)
  else
    # Default to UTF-8 for JSON APIs
    body = body.force_encoding("UTF-8")
  end
  
  # Validate and clean if necessary
  unless body.valid_encoding?
    body = body.encode("UTF-8", invalid: :replace, replace: "")
  end
  
  JSON.parse(body)
rescue Encoding::InvalidByteSequenceError => e
  Rails.logger.error "Encoding error processing API response: #{e.message}"
  nil
end

Database operations benefit from explicit encoding configuration to ensure consistent data storage and retrieval. Most databases default to UTF-8, but legacy systems may use different encodings.

# Database encoding configuration
class ApplicationRecord < ActiveRecord::Base
  # Ensure UTF-8 encoding for text fields
  before_save :normalize_text_encodings
  
  private
  
  def normalize_text_encodings
    self.class.columns.each do |column|
      if column.type == :text || column.type == :string
        value = read_attribute(column.name)
        if value.is_a?(String)
          normalized = sanitize_user_input(value)
          write_attribute(column.name, normalized)
        end
      end
    end
  end
end

# Migration for encoding issues
class FixEncodingIssues < ActiveRecord::Migration[7.0]
  def up
    # Fix existing data with encoding problems
    User.find_each do |user|
      if user.name && !user.name.valid_encoding?
        user.update_column(:name, 
          user.name.encode("UTF-8", 
                          invalid: :replace, 
                          replace: ""))
      end
    end
  end
end

Reference

Core Classes and Methods

Class/Method Parameters Returns Description
Encoding.list none Array<Encoding> All available encoding objects
Encoding.find(name) name (String) Encoding Encoding object by name
Encoding.default_external none Encoding Default encoding for I/O operations
Encoding.default_internal none Encoding or nil Default internal encoding
String#encoding none Encoding String's current encoding
String#force_encoding(encoding) encoding (String/Encoding) String Change encoding without conversion
String#encode(encoding, **opts) encoding, options (Hash) String Convert to different encoding
String#encode!(encoding, **opts) encoding, options (Hash) String In-place encoding conversion
String#valid_encoding? none Boolean Check if bytes are valid for encoding
String#ascii_only? none Boolean Check if string contains only ASCII characters

String Encoding Methods

Method Parameters Returns Description
String#b none String Return binary (ASCII-8BIT) copy
String#bytes none Array<Integer> Array of byte values
String#chars none Array<String> Array of characters
String#codepoints none Array<Integer> Array of Unicode codepoints
String#each_byte block Enumerator or String Iterate over bytes
String#each_char block Enumerator or String Iterate over characters
String#each_codepoint block Enumerator or String Iterate over codepoints

Encoding Objects

Method Parameters Returns Description
Encoding#name none String Canonical encoding name
Encoding#names none Array<String> All names and aliases
Encoding#ascii_compatible? none Boolean Whether encoding is ASCII-compatible
Encoding#dummy? none Boolean Whether encoding is a dummy encoding
Encoding#inspect none String Human-readable representation

Encoding Converter

Method Parameters Returns Description
Encoding::Converter.new(src, dst, **opts) source, destination, options Converter Create new converter
Converter#convert(string) string (String) String Convert string completely
Converter#primitive_convert(src, dst, limit) source, destination, byte limit Symbol Incremental conversion
Converter#finish none String Finish conversion and return remaining output
Converter#source_encoding none Encoding Source encoding
Converter#destination_encoding none Encoding Destination encoding

Common Encodings

Constant Name Description
Encoding::ASCII_8BIT "ASCII-8BIT" Binary data, no character interpretation
Encoding::US_ASCII "US-ASCII" 7-bit ASCII character set
Encoding::UTF_8 "UTF-8" Unicode UTF-8 encoding
Encoding::UTF_16 "UTF-16" Unicode UTF-16 encoding
Encoding::UTF_32 "UTF-32" Unicode UTF-32 encoding
Encoding::ISO_8859_1 "ISO-8859-1" Latin-1 character set
Encoding::Windows_1252 "Windows-1252" Windows Latin character set
Encoding::Shift_JIS "Shift_JIS" Japanese character encoding

Conversion Options

Option Values Description
:invalid :replace Replace invalid byte sequences
:undef :replace Replace undefined characters
:replace String Replacement character/string
:fallback Hash/Proc Custom character fallbacks
:xml :text, :attr XML entity replacement mode
:cr_newline Boolean Convert CR to platform newlines
:crlf_newline Boolean Convert CRLF to platform newlines
:universal_newline Boolean Convert all newlines to platform format

Exception Hierarchy

Exception Description
Encoding::CompatibilityError Incompatible encodings in operation
Encoding::InvalidByteSequenceError Invalid bytes for target encoding
Encoding::UndefinedConversionError Character undefined in target encoding
Encoding::ConverterNotFoundError No converter available for encoding pair

Error Information Methods

Method Available On Returns Description
#source_encoding InvalidByteSequenceError, UndefinedConversionError Encoding Source encoding
#destination_encoding InvalidByteSequenceError, UndefinedConversionError Encoding Target encoding
#error_bytes InvalidByteSequenceError String Invalid byte sequence
#error_char UndefinedConversionError String Undefined character
#readagain_bytes InvalidByteSequenceError String Additional problematic bytes