CrackedRuby logo

CrackedRuby

String Encoding

Overview

String encoding in Ruby controls how text characters map to byte sequences in memory and storage. Ruby represents strings as sequences of bytes combined with an encoding object that defines how to interpret those bytes as characters. Every string carries encoding metadata that determines character boundaries, case conversion rules, and validity constraints.

The Encoding class provides the foundation for all encoding operations. Ruby includes over 100 built-in encodings, from ASCII and UTF-8 to legacy formats like Windows-1252 and ISO-8859-1. The String#encoding method returns the encoding object, while String#encode converts between encodings.

str = "Hello, 世界"
str.encoding               # => #<Encoding:UTF-8>
str.bytesize              # => 13 (bytes in memory)
str.size                  # => 9 (logical characters)

Ruby strings exist in one of three validity states. Valid strings contain only byte sequences that form complete characters in their declared encoding. Broken strings contain invalid byte sequences but Ruby can still perform many operations on them. Incomplete strings end with a partial character sequence.

# Valid UTF-8
valid = "café".force_encoding('UTF-8')
valid.valid_encoding?     # => true

# Broken encoding - invalid UTF-8 bytes
broken = "\xFF\xFE".force_encoding('UTF-8') 
broken.valid_encoding?    # => false

# String operations still work
broken.upcase             # => "\xFF\xFE"

The default source encoding for Ruby files is UTF-8. Ruby automatically handles encoding for string literals, but external data requires explicit encoding management. File operations, network communication, and database interactions all involve encoding decisions that affect correctness and performance.

Basic Usage

The primary encoding operations involve checking current encoding, converting between encodings, and forcing encoding interpretation. The String#encoding method returns the encoding object, while String#force_encoding changes the encoding label without converting bytes.

str = "résumé"
str.encoding                    # => #<Encoding:UTF-8>
str.force_encoding('ASCII-8BIT') # Change label only
str.encoding                    # => #<Encoding:ASCII-8BIT>
str.valid_encoding?             # => false (invalid ASCII bytes)

The String#encode method converts byte sequences from one encoding to another. This differs from force_encoding because it actually transforms the underlying bytes to match the target encoding's representation.

utf8_str = "café"               # UTF-8 source
latin1_str = utf8_str.encode('ISO-8859-1')
latin1_str.encoding             # => #<Encoding:ISO-8859-1>
latin1_str.bytes                # => [99, 97, 102, 233] (different bytes)

# Converting back
utf8_again = latin1_str.encode('UTF-8')
utf8_again == utf8_str          # => true

The Encoding.default_external and Encoding.default_internal settings control automatic conversion during I/O operations. Ruby applies these defaults when reading files or network data without explicit encoding specification.

# Check current defaults
Encoding.default_external       # => #<Encoding:UTF-8>
Encoding.default_internal       # => nil

# File operations use these defaults
File.write('test.txt', 'café')
content = File.read('test.txt')
content.encoding                # => #<Encoding:UTF-8>

String concatenation between different encodings follows specific precedence rules. ASCII-compatible encodings can combine with ASCII strings, but incompatible encodings raise Encoding::CompatibilityError.

utf8_str = "café"               # UTF-8
ascii_str = "shop"              # ASCII (compatible with UTF-8)
result = utf8_str + ascii_str   # Works fine
result.encoding                 # => #<Encoding:UTF-8>

# Incompatible encodings
utf16_str = "test".encode('UTF-16')
utf8_str + utf16_str            # => Encoding::CompatibilityError

Pattern matching with regular expressions requires encoding compatibility between the pattern and target string. Ruby automatically promotes ASCII patterns to match the string's encoding when possible.

utf8_text = "The café is open"
ascii_pattern = /caf/           # ASCII pattern
utf8_text =~ ascii_pattern      # => 4 (works due to ASCII compatibility)

# Unicode character classes work with proper encoding
utf8_text.scan(/\p{L}+/)        # => ["The", "café", "is", "open"]

Error Handling & Debugging

Encoding errors fall into several categories, each requiring different handling strategies. Encoding::UndefinedConversionError occurs when the target encoding cannot represent specific characters. Encoding::InvalidByteSequenceError happens when source bytes are invalid for their declared encoding.

# UndefinedConversionError - character doesn't exist in target encoding
utf8_str = "café"
begin
  ascii_str = utf8_str.encode('ASCII')
rescue Encoding::UndefinedConversionError => e
  puts "Cannot convert: #{e.error_char}"     # => "Cannot convert: é"
  puts "Source encoding: #{e.source_encoding}" # => UTF-8
  puts "Target encoding: #{e.destination_encoding}" # => US-ASCII
end

The encode method accepts options to handle conversion errors gracefully. The :invalid and :undef options specify replacement strategies, while :replace sets the replacement string.

problematic = "café\xFF"        # Mixed valid UTF-8 and invalid byte
# Handle both undefined conversion and invalid bytes
safe_ascii = problematic.encode('ASCII', 
  :invalid => :replace,         # Replace invalid bytes
  :undef => :replace,           # Replace undefined characters  
  :replace => '?')              # Use '?' as replacement
# => "caf??"

# XML-safe replacements
xml_safe = problematic.encode('ASCII',
  :invalid => :replace,
  :undef => :replace, 
  :replace => '&#xFFFD;')       # Unicode replacement character entity

Debugging encoding issues requires examining both the byte-level representation and the logical character structure. The String#bytes method reveals the actual byte sequence, while String#codepoints shows Unicode values.

suspicious = "caf\xE9"          # Suspicious string
suspicious.encoding             # => #<Encoding:UTF-8> (default)
suspicious.valid_encoding?      # => false

# Examine the bytes
suspicious.bytes                # => [99, 97, 102, 233]
# 233 (0xE9) is invalid UTF-8 sequence start

# Try different encoding interpretation
latin1_version = suspicious.force_encoding('ISO-8859-1')
latin1_version.valid_encoding?  # => true
latin1_version                  # => "café" (correct interpretation)

# Convert to proper UTF-8
fixed = latin1_version.encode('UTF-8')
fixed.bytes                     # => [99, 97, 102, 195, 169] (valid UTF-8)

The String#scrub method removes or replaces invalid byte sequences, providing a robust way to clean untrusted input. This method works regardless of the specific encoding errors present.

# String with multiple encoding problems
broken = "good\xFF\xFEbad\x80text"
broken.valid_encoding?          # => false

# Remove invalid sequences
cleaned = broken.scrub
cleaned                         # => "goodbadtext"

# Custom replacement
cleaned_custom = broken.scrub('') 
cleaned_custom                  # => "good�bad�text"

# Block form for complex replacement logic
smart_clean = broken.scrub do |invalid_bytes|
  invalid_bytes.unpack('H*')[0] # Show hex representation
end
smart_clean                     # => "goodfffebadjpegtext"

File encoding detection requires heuristic analysis since most file formats don't include explicit encoding metadata. The chardet gem provides sophisticated detection, but simple patterns often suffice for known data sources.

# Simple BOM detection for UTF files
def detect_encoding(file_path)
  bytes = File.read(file_path, 3, encoding: 'ASCII-8BIT')
  case bytes
  when "\xEF\xBB\xBF"           # UTF-8 BOM
    'UTF-8'
  when "\xFF\xFE"               # UTF-16LE BOM
    'UTF-16LE'  
  when "\xFE\xFF"               # UTF-16BE BOM
    'UTF-16BE'
  else
    'UTF-8'                     # Assume UTF-8 for unknown
  end
end

Performance & Memory

Encoding conversions carry significant performance costs, especially for large strings. UTF-8 to ASCII conversion requires scanning every byte, while UTF-16 to UTF-8 conversion involves mathematical transformations for each character. Avoiding unnecessary conversions improves both speed and memory usage.

# Performance comparison setup
require 'benchmark'

large_text = "café" * 100_000   # 400KB UTF-8 string
iterations = 1000

Benchmark.bm(20) do |x|
  x.report("force_encoding:") do
    iterations.times { large_text.force_encoding('ASCII-8BIT') }
  end
  
  x.report("encode ASCII:") do
    iterations.times { large_text.encode('ASCII', :invalid => :replace) }
  end
  
  x.report("encode UTF-16:") do  
    iterations.times { large_text.encode('UTF-16') }
  end
end

# Results show force_encoding is nearly free (just metadata change)
# while encode operations involve actual byte transformation

Memory allocation patterns differ significantly between encoding operations. force_encoding changes only metadata without allocating new strings, while encode always creates new string objects with potentially different byte lengths.

original = "café" * 10_000
puts "Original: #{original.bytesize} bytes"

# Force encoding - no new allocation
forced = original.force_encoding('ASCII-8BIT')  
puts "Forced: #{forced.bytesize} bytes"         # Same byte count

# Encoding conversion - new allocation required
utf16_version = original.encode('UTF-16')
puts "UTF-16: #{utf16_version.bytesize} bytes"  # Different byte count (larger)

ascii_version = original.encode('ASCII', :invalid => :replace)
puts "ASCII: #{ascii_version.bytesize} bytes"   # Different byte count (smaller)

String operations on encoded text show varying performance characteristics. Character-based operations like String#size require encoding-aware scanning, while byte-based operations like String#bytesize access cached metadata.

text = "résumé" * 50_000        # Mixed ASCII/UTF-8 content

Benchmark.bm(15) do |x|
  x.report("bytesize:") do
    100_000.times { text.bytesize }     # O(1) - metadata lookup
  end
  
  x.report("size:") do  
    100_000.times { text.size }         # O(n) - character counting
  end
  
  x.report("valid?:") do
    10_000.times { text.valid_encoding? } # O(n) - full validation scan
  end
end

Bulk processing benefits from encoding normalization strategies that minimize conversion overhead. Converting all inputs to a common encoding once, rather than converting repeatedly during processing, reduces total computational cost.

# Inefficient: convert during each operation
def process_files_inefficient(file_paths)
  results = []
  file_paths.each do |path|
    content = File.read(path, encoding: 'ASCII-8BIT')  # Read as binary
    utf8_content = content.encode('UTF-8', :invalid => :replace)
    processed = utf8_content.upcase.gsub(/\s+/, ' ')   # Multiple operations
    results << processed.encode('ASCII', :invalid => :replace)
  end
  results
end

# Efficient: batch convert once
def process_files_efficient(file_paths) 
  # Read and normalize all files first
  normalized_contents = file_paths.map do |path|
    content = File.read(path, encoding: 'ASCII-8BIT')
    content.encode('UTF-8', :invalid => :replace)
  end
  
  # Process in consistent encoding
  results = normalized_contents.map do |content|
    content.upcase.gsub(/\s+/, ' ')
  end
  
  # Convert outputs once if needed
  results.map { |r| r.encode('ASCII', :invalid => :replace) }
end

Common Pitfalls

Encoding assumption errors represent the most frequent category of string encoding problems. Developers often assume UTF-8 encoding for external data, but file systems, databases, and network protocols may use different encodings without explicit indication.

# Dangerous assumption - file might not be UTF-8
def read_user_file_wrong(path)
  content = File.read(path)       # Assumes UTF-8
  content.upcase                  # May fail with invalid UTF-8
end

# Safe approach with encoding detection/handling  
def read_user_file_safe(path)
  # Read as binary first to inspect bytes
  binary_content = File.read(path, encoding: 'ASCII-8BIT')
  
  # Try UTF-8 first
  if binary_content.force_encoding('UTF-8').valid_encoding?
    return binary_content.force_encoding('UTF-8')
  end
  
  # Fallback to Latin-1 (can represent any byte)
  binary_content.force_encoding('ISO-8859-1').encode('UTF-8')
end

The force_encoding vs encode confusion leads to subtle bugs where strings appear correct but contain invalid byte sequences. force_encoding changes interpretation without validating compatibility, while encode performs actual conversion.

# Common mistake: using force_encoding for conversion
latin1_bytes = "café".encode('ISO-8859-1')
# Wrong way - just changes label, doesn't convert bytes
wrong_utf8 = latin1_bytes.force_encoding('UTF-8')  
wrong_utf8.valid_encoding?      # => false! (invalid UTF-8 bytes)

# Correct way - actually converts byte sequences  
right_utf8 = latin1_bytes.encode('UTF-8')
right_utf8.valid_encoding?      # => true

Regular expression encoding compatibility creates unexpected match failures. Patterns compiled with one encoding may fail to match strings with different encodings, even when the content appears identical.

# Pattern compiled with ASCII encoding
ascii_pattern = /café/
ascii_pattern.encoding          # => #<Encoding:US-ASCII>

utf8_text = "I love café food"  
utf8_text.encoding              # => #<Encoding:UTF-8>

# This works due to ASCII compatibility promotion
result = utf8_text =~ ascii_pattern  # => 7

# But explicit ASCII pattern breaks with non-ASCII content
strict_ascii_pattern = /caf\xe9/.force_encoding('ASCII')
utf8_text =~ strict_ascii_pattern    # => nil (encoding incompatible)

# Solution: ensure pattern encoding matches text encoding
compatible_pattern = /café/.encode(utf8_text.encoding)
utf8_text =~ compatible_pattern      # => 7

JSON and XML parsing libraries exhibit inconsistent encoding behavior. Some libraries respect source encoding, others force UTF-8, and many provide configuration options that change default behavior between versions.

require 'json'

# JSON standard requires UTF-8, but input might be different
latin1_json_bytes = '{"name": "caf\xe9"}'.encode('ISO-8859-1')

# This might fail depending on JSON library version and configuration
begin
  # Some JSON parsers expect UTF-8 only
  parsed = JSON.parse(latin1_json_bytes)
rescue JSON::ParserError => e
  # Convert to UTF-8 first  
  utf8_json = latin1_json_bytes.encode('UTF-8')
  parsed = JSON.parse(utf8_json)
end

Database encoding mismatches cause data corruption that persists across application restarts. Character data stored with incorrect encoding interpretation becomes permanently corrupted unless detected and repaired quickly.

# Simulate database encoding mismatch
class DatabaseSimulator
  def initialize
    @storage = []
  end
  
  # Wrong: stores UTF-8 bytes as Latin-1
  def store_wrong(text)
    # Application sends UTF-8, database interprets as Latin-1
    stored_bytes = text.encode('UTF-8').force_encoding('ISO-8859-1')
    @storage << stored_bytes
    stored_bytes
  end
  
  # Retrieval compounds the problem  
  def retrieve_wrong(index)
    stored = @storage[index]
    # Database returns "Latin-1", application expects UTF-8
    stored.force_encoding('UTF-8')    # Invalid UTF-8!
  end
  
  # Correct approach - consistent encoding throughout
  def store_correct(text)
    utf8_text = text.encode('UTF-8')  # Normalize input
    @storage << utf8_text
    utf8_text
  end
  
  def retrieve_correct(index)
    @storage[index]                   # Already UTF-8
  end
end

# Demonstration of corruption
db = DatabaseSimulator.new
original = "café"

# Wrong way corrupts data
db.store_wrong(original)
retrieved_wrong = db.retrieve_wrong(0)
retrieved_wrong.valid_encoding?       # => false
retrieved_wrong == original           # => false

# Correct way preserves data
db.store_correct(original)  
retrieved_correct = db.retrieve_correct(1)
retrieved_correct.valid_encoding?     # => true
retrieved_correct == original         # => true

Reference

Core Encoding Methods

Method Parameters Returns Description
String#encoding none Encoding Returns the encoding object for the string
String#force_encoding(encoding) encoding (String/Encoding) String Changes encoding label without converting bytes
String#encode(encoding, **opts) encoding (String/Encoding), options (Hash) String Converts string to specified encoding
String#encode!(encoding, **opts) encoding (String/Encoding), options (Hash) String In-place encoding conversion
String#valid_encoding? none Boolean Tests if string contains valid byte sequences
String#ascii_only? none Boolean Tests if string contains only ASCII characters
String#scrub(replace=nil) replace (String), block optional String Removes or replaces invalid byte sequences

Encoding Information Methods

Method Parameters Returns Description
String#bytesize none Integer Returns byte count in string
String#size / String#length none Integer Returns character count in string
String#bytes none Array<Integer> Returns array of byte values
String#codepoints none Array<Integer> Returns array of Unicode code points
String#each_byte block Enumerator/String Iterates over each byte
String#each_codepoint block Enumerator/String Iterates over each code point

Encoding Class Methods

Method Parameters Returns Description
Encoding.list none Array<Encoding> Returns all available encodings
Encoding.find(name) name (String) Encoding Finds encoding by name or alias
Encoding.default_external none Encoding Returns default external encoding
Encoding.default_internal none Encoding/nil Returns default internal encoding
Encoding.compatible?(str1, str2) two objects Encoding/nil Returns compatible encoding or nil

Conversion Options

Option Values Description
:invalid :replace, :ignore How to handle invalid byte sequences
:undef :replace, :ignore How to handle undefined character conversions
:replace String Replacement string for invalid/undefined characters
:fallback Hash/Proc Character-specific replacement mappings
:xml :text, :attr XML-safe replacement mode
:cr_newline Boolean Convert CR to LF during transcoding
:crlf_newline Boolean Convert CRLF to LF during transcoding
:universal_newline Boolean Convert all newline types to LF

Common Encoding Names

Encoding Aliases Byte Width Description
UTF-8 CP65001 1-4 bytes Unicode, web standard
UTF-16 UTF-16BE/LE 2-4 bytes Unicode, Windows common
ASCII US-ASCII 1 byte 7-bit ASCII characters only
ASCII-8BIT BINARY 1 byte Binary data, no character interpretation
ISO-8859-1 Latin-1 1 byte Western European languages
Windows-1252 CP1252 1 byte Windows Western European
Shift_JIS SJIS 1-2 bytes Japanese character encoding

Exception Hierarchy

EncodingError
├── Encoding::UndefinedConversionError
├── Encoding::InvalidByteSequenceError  
├── Encoding::ConverterNotFoundError
└── Encoding::CompatibilityError

Encoding Detection Patterns

Byte Sequence Encoding Location
EF BB BF UTF-8 BOM at start
FF FE UTF-16LE BOM at start
FE FF UTF-16BE BOM at start
00 xx 00 xx UTF-16BE Pattern in text
xx 00 xx 00 UTF-16LE Pattern in text
< 0x80 all bytes ASCII Throughout
Valid UTF-8 sequences UTF-8 Heuristic analysis

Performance Characteristics

Operation Time Complexity Notes
String#bytesize O(1) Cached metadata
String#size O(n) Must count characters
String#valid_encoding? O(n) Full validation scan
String#force_encoding O(1) Metadata change only
String#encode O(n) Byte transformation required
String#scrub O(n) Scans and potentially copies